Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable EVEX support by default #83648

Merged
merged 16 commits into from
Mar 25, 2023
Merged

Enable EVEX support by default #83648

merged 16 commits into from
Mar 25, 2023

Conversation

tannergooding
Copy link
Member

This does some cleanup to enable EVEX support by default and to ensure that relevant CI jobs exist to test the non-AVX512 paths.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 19, 2023
@ghost ghost assigned tannergooding Mar 19, 2023
@ghost
Copy link

ghost commented Mar 19, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

This does some cleanup to enable EVEX support by default and to ensure that relevant CI jobs exist to test the non-AVX512 paths.

Author: tannergooding
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@@ -304,7 +304,6 @@ CONFIG_INTEGER(EnableMultiRegLocals, W("EnableMultiRegLocals"), 1) // Enable the
#if defined(DEBUG)
CONFIG_INTEGER(JitStressEvexEncoding, W("JitStressEvexEncoding"), 0) // Enable EVEX encoding for SIMD instructions when
// AVX-512VL is available.
CONFIG_INTEGER(JitForceEVEXEncoding, W("JitForceEVEXEncoding"), 0) // Force EVEX encoding for SIMD instructions.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force isn't needed because we enable all ISAs in AltJit by default. So we can "force" EVEX simply by running the AltJit.

This ends up being safer as well since it doesn't require us to have AVX512 capable machines since AltJit code won't be run by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noting that if we want to disable EVEX support, its as simple as changing the following one line:

- CONFIG_INTEGER(EnableAVX512F,      W("EnableAVX512F"),      1)
+ CONFIG_INTEGER(EnableAVX512F,      W("EnableAVX512F"),      0) 

This will have AVX512F be off by default but allow users to opt-in by setting it to 1. So its trivial to disable this if necessary due to any potential bugs or edge cases discovered after merge.

@@ -103,7 +103,7 @@ const char* CodeGen::genInsDisplayName(emitter::instrDesc* id)

const emitter* emit = GetEmitter();

if (emit->IsVexOrEvexEncodedInstruction(ins))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encodable is a bit more accurate (then Encoded) since we're doing a "can it be encoded" rather than a "must it be encoded" check.

Comment on lines +208 to 217
bool emitter::IsVexOrEvexEncodableInstruction(instruction ins) const
{
return IsVexEncodedInstruction(ins) || IsEvexEncodedInstruction(ins);
if (!UseVEXEncoding())
{
return false;
}

insFlags flags = CodeGenInterface::instInfo[ins];
return (flags & (Encoding_VEX | Encoding_EVEX)) != 0;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit faster than checking them independently. We don't need to check UseEVEXEncoding() since we'll end up filtering by TakesEvexPrefix(id) which in turn duplicates the IsEvexEncodableInstruction check.

We can't check just VEX since we need to also return true for EVEX only instructions.

return false;
}

if (!emitComp->DoJitStressEvexEncoding())
if (HasHighSIMDReg(id) || (id->idOpSize() == OPSZ64) || HasKMaskRegisterDest(ins))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IsEvexEncodableInstruction excludes the kmask instructions, since they are VEX only. This simplifies the checks we do here since it means we can just check for kmask usage in general.

}

return IsVexEncodedInstruction(ins);
return IsVexEncodableInstruction(ins) && (ins != INS_vzeroupper);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably remove the ins != INS_vzeroupper check, but it might require a couple additional tweaks to get working, so I'm leaving that to a future PR.

{
if (TakesEvexPrefix(id))
if (TakesEvexPrefix(id) && codeEvexMigrationCheck(code)) // TODO-XArch-AVX512: Remove codeEvexMigrationCheck().
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might actually be able to remove the codeEvexMigrationCheck(code) at this point, but I'm likewise leaving that to a future PR.

Comment on lines +6019 to +6022
if (JitConfig.EnableAVX512F() != 0)
{
instructionSetFlags.AddInstructionSet(InstructionSet_AVX512F);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the below are what allows AVX512 to light up in the AltJit.

Comment on lines +2284 to +2288
// x86-64-v4 feature level supports AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL
// These have been shipped together historically and at the time of this writing
// there exists no hardware which doesn't support the entire feature set. To simplify
// the overall JIT implementation, we currently require the entire set of ISAs to be
// supported and disable AVX512 support otherwise.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the comment says, we want to support AVX512 only if all of AVX512F, AVX512BW, AVX512CD, AVX512DQ, and AVX512VL are supported.

These ISAs form the x86-64-v4 baseline and there has never shipped a piece of hardware without all of them.

Some examples of things we'd have to consider are that legacy-encoded xorps is SSE and VEX-encoded vxorps is AVX. However, EVEX-encoded xorps is AVX512DQ. Likewise the EVEX support for XMM/YMM based xorps is then AVX512DQ + AVX512VL (AVX512DQ_VL).

Supporting this "properly" would require us to add some fairly complex checks to import and likely LSRA to handle the difference and ensure that some suitable fallback is generated. We could write all the logic to support them, but without hardware existing that will will be "needless" overhead and negatively impact throughput. So its much easier to just write the JIT to disable AVX512 entirely if any of the "core" ISAs are unsupported.

@tannergooding
Copy link
Member Author

@dotnet/jit-contrib It is also worth noting, in addition to the analysis given above, that the TP "regression" is only for hardware with AVX-512 support (which all of our CI machines currently have).

The TP for hardware without AVX-512 will be effectively what it was prior to this PR. -- That is, this hardware isn't really changing the code we execute when AVX-512 is disabled as all the paths already existed and were just being filtered on whether or not the JIT said it supported AVX-512.

For AVX-512 enabled hardware, with this PR we're implicitly using AVX-512 already (as evidenced by the -40k diff to disassembly). This applies to scalar floating-point code, 128-bit and 256-bit SIMD code. This codegen improvement will only improve further as we add more AVX-512 support and relevant optimizations. It will further improve as we add explicit optimizations to the libraries as well.

For context save/restore, Windows already handles this efficiently via its own APIs. The GC has had AVX-512 support for a couple years now and so native is already paying any "additional cost". For Unix, there are some things we could improve but that entails more in depth changes to use the native mechanics rather than trying to abstract things over the general Win32 API surface area.

@BruceForstall
Copy link
Member

The diffs show especially significant TP regressions for MinOpts -- as much as +4.26% on the MinOpts parts of libraries.pmi.windows.x64.checked.mch. I wouldn't think that would be explained by the LSRA affect of availableRegCount increasing, because I wouldn't expect LSRA to be so involved in MinOpts.

Can you disable the AVX-512 code using DOTNET_EnableAVX512F=0? If so, you could run superpmi.py tpdiff -diff_jit_option EnableAVX512F=0 with your change and see if the MinOpts TP regression (and the other TP regressions) disappear.

@tannergooding
Copy link
Member Author

Can you disable the AVX-512 code using DOTNET_EnableAVX512F=0?

Yes and this is functionally what happens implicitly for hardware without AVX512 support.

If so, you could run superpmi.py tpdiff -diff_jit_option EnableAVX512F=0 with your change and see if the MinOpts TP regression (and the other TP regressions) disappear.

This doesn't "quite" work as DOTNET_EnableAVX512F is a VM option. There is a separate and identical knob in the JIT, but its only looked up as part of AltJit and only to enable ISAs that might otherwise be unsupported (this allows people without hardware to still test and get disasm for newer ISAs).

In order to get the below numbers I temporarily changed the check to the following so SPMI would work:

     if (instructionSetFlags.HasInstructionSet(InstructionSet_AVX512BW_VL) &&
         instructionSetFlags.HasInstructionSet(InstructionSet_AVX512CD_VL) &&
         instructionSetFlags.HasInstructionSet(InstructionSet_AVX512DQ_VL) &&
+        JitConfig.EnableAVX512F())

I can submit a follow up/separate PR to enable AltJIT to also disable ISAs that are supported by the hardware and to ensure the JIT side knobs work by default with SPMI.

The remaining perf difference comes from the newer compiler (~0.03-0.05%) and various changes in emitxarch. For example changing from UseEvexEncoding() && IsEvexEncodedInstruction(ins) to IsVexOrEvexEncodableInstruction(ins). This ultimately removed a lot of code, but without PGO the nested UseVEXEncoding() check doesn't get outlined

DOTNET_EnableAVX512F=1 (this PR, AVX512 enabled hardware)

Warning: Different compilers used for base and diff JITs. Results may be misleading.
Base JIT's compiler: MSVC 193532215
Diff JIT's compiler: MSVC 193632502

Overall (+0.25% to +2.01%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +1.70%
coreclr_tests.run.windows.x64.checked.mch +2.01%
libraries.crossgen2.windows.x64.checked.mch +0.25%
libraries.pmi.windows.x64.checked.mch +1.56%
libraries_tests.pmi.windows.x64.checked.mch +1.65%
MinOpts (+0.77% to +4.25%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +3.01%
coreclr_tests.run.windows.x64.checked.mch +3.00%
libraries.crossgen2.windows.x64.checked.mch +0.77%
libraries.pmi.windows.x64.checked.mch +4.25%
libraries_tests.pmi.windows.x64.checked.mch +3.33%
FullOpts (+0.25% to +1.63%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +1.42%
coreclr_tests.run.windows.x64.checked.mch +1.18%
libraries.crossgen2.windows.x64.checked.mch +0.25%
libraries.pmi.windows.x64.checked.mch +1.54%
libraries_tests.pmi.windows.x64.checked.mch +1.63%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 104,919,715,049 106,701,897,153 +1.70%
coreclr_tests.run.windows.x64.checked.mch 788,527,599,616 804,356,245,532 +2.01%
libraries.crossgen2.windows.x64.checked.mch 123,711,122,727 124,023,390,612 +0.25%
libraries.pmi.windows.x64.checked.mch 230,402,159,674 233,987,828,827 +1.56%
libraries_tests.pmi.windows.x64.checked.mch 502,104,878,759 510,383,132,202 +1.65%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 18,423,693,189 18,978,677,027 +3.01%
coreclr_tests.run.windows.x64.checked.mch 359,918,735,082 370,699,709,109 +3.00%
libraries.crossgen2.windows.x64.checked.mch 1,712,794 1,725,952 +0.77%
libraries.pmi.windows.x64.checked.mch 1,319,088,937 1,375,098,102 +4.25%
libraries_tests.pmi.windows.x64.checked.mch 6,105,990,635 6,309,223,185 +3.33%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 86,496,021,860 87,723,220,126 +1.42%
coreclr_tests.run.windows.x64.checked.mch 428,608,864,534 433,656,536,423 +1.18%
libraries.crossgen2.windows.x64.checked.mch 123,709,409,933 124,021,664,660 +0.25%
libraries.pmi.windows.x64.checked.mch 229,083,070,737 232,612,730,725 +1.54%
libraries_tests.pmi.windows.x64.checked.mch 495,998,888,124 504,073,909,017 +1.63%

DOTNET_EnableAVX512F=0 (this PR, non-AVX512 enabled hardware)

Warning: Different compilers used for base and diff JITs. Results may be misleading.
Base JIT's compiler: MSVC 193532215
Diff JIT's compiler: MSVC 193632502

Overall (+0.21% to +0.41%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.29%
coreclr_tests.run.windows.x64.checked.mch +0.41%
libraries.crossgen2.windows.x64.checked.mch +0.25%
libraries.pmi.windows.x64.checked.mch +0.21%
libraries_tests.pmi.windows.x64.checked.mch +0.21%
MinOpts (+0.52% to +0.79%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.79%
coreclr_tests.run.windows.x64.checked.mch +0.68%
libraries.crossgen2.windows.x64.checked.mch +0.76%
libraries.pmi.windows.x64.checked.mch +0.66%
libraries_tests.pmi.windows.x64.checked.mch +0.52%
FullOpts (+0.18% to +0.25%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.19%
coreclr_tests.run.windows.x64.checked.mch +0.18%
libraries.crossgen2.windows.x64.checked.mch +0.25%
libraries.pmi.windows.x64.checked.mch +0.21%
libraries_tests.pmi.windows.x64.checked.mch +0.21%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 105,445,894,170 105,754,830,649 +0.29%
coreclr_tests.run.windows.x64.checked.mch 813,671,944,975 816,980,627,429 +0.41%
libraries.crossgen2.windows.x64.checked.mch 123,710,946,415 124,017,242,367 +0.25%
libraries.pmi.windows.x64.checked.mch 231,164,671,088 231,657,914,986 +0.21%
libraries_tests.pmi.windows.x64.checked.mch 503,130,079,375 504,205,511,144 +0.21%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 18,462,321,928 18,607,792,875 +0.79%
coreclr_tests.run.windows.x64.checked.mch 366,939,039,379 369,425,191,639 +0.68%
libraries.crossgen2.windows.x64.checked.mch 1,712,799 1,725,868 +0.76%
libraries.pmi.windows.x64.checked.mch 1,319,381,990 1,328,038,412 +0.66%
libraries_tests.pmi.windows.x64.checked.mch 6,106,527,083 6,138,104,650 +0.52%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 86,983,572,242 87,147,037,774 +0.19%
coreclr_tests.run.windows.x64.checked.mch 446,732,905,596 447,555,435,790 +0.18%
libraries.crossgen2.windows.x64.checked.mch 123,709,233,616 124,015,516,499 +0.25%
libraries.pmi.windows.x64.checked.mch 229,845,289,098 230,329,876,574 +0.21%
libraries_tests.pmi.windows.x64.checked.mch 497,023,552,292 498,067,406,494 +0.21%

I wouldn't think that would be explained by the LSRA affect of availableRegCount increasing, because I wouldn't expect LSRA to be so involved in MinOpts.

I believe you're underestimating the impact on the retired instruction count of availableRegCount increasing. We're going from 32 registers (16 integer + 16 floating) to 56 registers (16 integer + 32 floating + 8 mask) - 75% more registers

This then tracks that the regression to x86 is much less (up to 1.5% rather than up to 4.26%) since it is going from 16 registers (8 integer + 8 floating) to 24 registers (8 integer + 8 floating + 8 mask) - 50% more registers.

Of the 8 places in LSRA that do for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg = REG_NEXT(reg)), 2 of them will always execute for MinOpts:

  • allocateRegisters - called once per compilation in doLinearScan
  • processBlockStartLocations - called once per block per compilation in resolveRegisters

However, as the profile shows these aren't "hot paths" for the code. So while we are executing more instructions, tthey don't actually cause any real change to wall clock time. You have to factor in not only retired instructions, but also cycles not in halt, which together are used to measure a cycles per instruction (CPI). You'll notice that LinearScan::allocateReg has a CPI of 0.59, so its able to dispatch almost 2 instructions per cycle, reliably. However, impImportBlockCode has a CPI of 0.97 so it's only able to dispatch about 1 instruction per cycle. This makes impImportBlockCode almost as expensive as allocateRegisters despite executing significantly less instructions and therefore having a much less impact on the TP metric that SPMI is collecting.

As a corollary example, see #83479. This was an extremely simple change to a for loop in the LinearScan constructor. This loop was called once per compilation and was simply doing some trivial if checks for the 23 TYP_* kinds. Linearizing this loop, which effectively just removed an average of 6 compares/branches per type, accounted for a -0.28% change to TP. This had no real change to crossgen throughput: Crossgen2 Throughput - Single - System.Private.Corelib -or- Crossgen2 Throughput - Single Threaded - Single - System.Private.Corelib.

None of the increases to CG2 wall clock time have been due to the AVX512 support added so far. I don't expect that to change with this PR either. However, if it does we have some identified real hot spots from the profiles above that show us where we can most easily win that back.

@jakobbotsch
Copy link
Member

As a corollary example, see #83479. This was an extremely simple change to a for loop in the LinearScan constructor. This loop was called once per compilation and was simply doing some trivial if checks for the 23 TYP_* kinds. Linearizing this loop, which effectively just removed an average of 6 compares/branches per type, accounted for a -0.28% change to TP. This had no real change to crossgen throughput: Crossgen2 Throughput - Single - System.Private.Corelib -or- Crossgen2 Throughput - Single Threaded - Single - System.Private.Corelib.

The -0.28% was in libraries.crossgen2 MinOpts. There are 15 of those contexts so it is not really a representative sample (indeed, we wouldn't expect many prejitted MinOpts contexts). The overall impact of that change on crossgen2 collections was much smaller, and certainly within the variance of crossgen2 throughput (which is large -- several percent).

I agree that actual profilers can give better data for the "macro" changes, but I think doing so on an SPMI run would have much less variance. Have we spent time with @SingleAccretion's PIN tool on this change yet? Can you post the detailed breakdown from that?

@tannergooding
Copy link
Member Author

Have we spent time with @SingleAccretion's PIN tool on this change yet? Can you post the detailed breakdown from that?

The pin tool uses Intel PIN, which is exactly what VTune uses under the covers. AMD uProf uses a similar set of model specific registers for tracking precise instruction counts as well. It is not going to give any additional information beyond what the hardware specific hardware profilers are reporting, which I already shared above.

If someone else wants to spend the extra cycles to build a local copy of the tool, to collect some metrics, etc. Then please feel free. I have several other work items I need to focus on and have already given significant data showing that this is likely not going to be problematic in practice. In the off chance that it is measurably problematic (which will be quickly caught in the weekly perf triage), then we have a one line change that will allow us to disable AVX512 support and investigate this more in depth at that time.

@EgorBo
Copy link
Member

EgorBo commented Mar 24, 2023

indeed, we wouldn't expect many prejitted MinOpts contexts

Surprised to see any MinOpts contexts in crossgen at all, I guess those are explicit MethodImpl(NoOptimizations)? Because even cctos are compiled with opts in crossgen (unlike in jit)

@tannergooding
Copy link
Member Author

The general point of the example was that we have made a plethora of changes in the past few months that have impacted the SPMI tracked TP metric. Both in terms of "regressions" and in terms of "improvements" for both MinOpts and Optimized code.

Despite this, and despite many of them being "significant" changes to the TP percentile, there has been no "real" change to the TP performance of crossgen2.

The only real regression was caused by the introduction of a non-trivial amount of new managed code and the only real improvement was caused by code simplification in part of the JIT's front end on a PR that did have some measured TP impact (-0.16%) which inversely had a 4.3% actual improvement to perf, because it was done on an actual hot path.

Instructions can range from 0.2 cycles with 5x dispatch all the way up to 40+ cycles for a 64-bit division. Some special instructions can even take 140 cycles or one of the worst cases, 1400 cycles. This means the SPMI "retired instructions" is an overall poor measurement of real world impact and while it is fine to look at as a sort of "basic heuristic" for potential impact, we should in general be looking at real perf numbers to make the final decision.

@EgorBo
Copy link
Member

EgorBo commented Mar 24, 2023

there has been no "real" change to the TP performance of crossgen2.

Our crossgen2 benchmarks don't measure TP for tier0, do they? The problem the team was worried about that Tier0 regressions are way more important than Tier1/CG/NAOT, because every Tier0 compilation by definition means that the application execution is stopped in a stub and is waiting for JIT to finish. So the more time we spend in Tier0 jitting the slower the startup. It's way less important for non-MinOpts where we promote methods asynchronously. Although, I agree with you that PINTOOL doesn't directly map into actual overhead. @kunalspathak can we kick a TE run from this PR via bot to see "startup time/Time to first response" change? (ideally, with ReadyToRun=0 if possible)

@tannergooding
Copy link
Member Author

tannergooding commented Mar 24, 2023

Our crossgen2 benchmarks don't measure TP for tier0, do they?

They do not by default, but many of the tracked TP improvements/regressions have applied equally or even greater to optimized code than to MinOpts. I've also explicitly measured crossgen of debug corelib using the release JIT, which does do minopts for everything.

We have extremely good CPI in LSRA and while it is overall one of the hotter (in terms of total actual cycles spent executing) pieces of code that the JIT currently, an increase in instructions to it is much less consequential than changes to impImportBlockCode or fgFindJumpTargets where we have very poor CPI. It having such good CPI also means that it is much harder to get real world perf improvements but much easier to see an SPMI TP regression caused by it.

It also means that improving something like fgFindJumpTargets is a much better overall use of time. -- We spend 16% of the 46.5 million cycles spent executing fgFindJumpTargets in the jump table logic that is generated for switch (opcode)

We similarly spend a significant amount of time in tree->OperRequiresAsgFlag for fgMorphSmpOp. Doing register saving in gtSetEvalOrder due to the size of the method, and just in impImportBlockCode in general due to it being so large the assembly isn't very good.

All of these are relatively low hanging fruit that would provide real measurable improvement to the JIT and are a significantly better use of our time.

@kunalspathak
Copy link
Member

/benchmark json aspnet-citrine-win runtime

@pr-benchmarks
Copy link

pr-benchmarks bot commented Mar 24, 2023

Benchmark started for json on aspnet-citrine-win with runtime. Logs: link

@kunalspathak
Copy link
Member

(ideally, with ReadyToRun=0 if possible)

I don't think it is possible, probably using --arguments, but haven't explored it.

@kunalspathak
Copy link
Member

Not sure why the results are not automatically posted, but here is what I get from link.

application json.base json.pr
CPU Usage (%) 76 81 +6.58%
Cores usage (%) 2,129 2,267 +6.48%
Working Set (MB) 78 78 0.00%
Private Memory (MB) 102 101 -0.98%
Build Time (ms) 3,441 3,433 -0.23%
Start Time (ms) 321 325 +1.25%
Published Size (KB) 97,720 97,720 0.00%
Symbols Size (KB) 52 52 0.00%
.NET Core SDK Version 8.0.100-preview.4.23174.1 8.0.100-preview.4.23174.1
load json.base json.pr
CPU Usage (%) 75 75 0.00%
Cores usage (%) 2,097 2,104 +0.33%
Working Set (MB) 48 48 0.00%
Private Memory (MB) 363 363 0.00%
Start Time (ms) 0 0
First Request (ms) 152 143 -5.92%
Requests/sec 1,108,954 1,101,094 -0.71%
Requests 16,744,703 16,625,384 -0.71%
Mean latency (ms) 1.25 1.21 -3.20%
Max latency (ms) 53.34 49.47 -7.26%
Bad responses 0 0
Socket errors 0 0
Read throughput (MB/s) 154.41 153.31 -0.71%
Latency 50th (ms) 0.30 0.31 +1.32%
Latency 75th (ms) 0.86 0.84 -2.44%
Latency 90th (ms) 3.38 3.32 -1.78%
Latency 99th (ms) 12.91 11.95 -7.44%

@tannergooding tannergooding mentioned this pull request Mar 24, 2023
@EgorBo
Copy link
Member

EgorBo commented Mar 24, 2023

Ah, looks like TTFR has StdDiv ~7-8% and StartUpTime has ~10-11% so there is no way we can detect that from a single run 🙁

image

Linux-x64 is better but still not stable enough

@BruceForstall
Copy link
Member

Ah, looks like TTFR has StdDiv ~7-8% and StartUpTime has ~10-11% so there is no way we can detect that from a single run

So I presume, then, that we can't draw any conclusions from the aspnet perf data?

@tannergooding
Copy link
Member Author

So I presume, then, that we can't draw any conclusions from the aspnet perf data?

Not without more runs due to the existing noisiness.

This PR in the results above is actually showing better time to first request by 9ms and better overall latency up through P99; but also 2ms slower build time and 4ms slower start time. This is all within the existing standard deviation so we can at least speculate that there isn't a "substantial" improvement/regression.

@BruceForstall
Copy link
Member

I generated a PIN diff (using the https://github.com/SingleAccretion/Dotnet-Runtime.Dev#analyze-pin-trace-diffps1---diff-the-traces-produced-by-the-pin-tool script) with a baseline of this PR with AVX512F disabled and this PR as the diff, using the libraries.pmi.windows.x64.checked.mch collection. These are all of the regressions (there were also per-function improvements), almost all in LSRA, as expected due to the for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg = REG_NEXT(reg)) loops.

Base: 231689703239, Diff: 234118161133, +1.0482%

?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z                       : 1285213159 : +50.45%  : 35.66% : +0.5547%
?allocateRegisters@LinearScan@@QEAAXXZ                                                : 527538735  : +10.13%  : 14.64% : +0.2277%
?newRefPositionRaw@LinearScan@@AEAAPEAVRefPosition@@IPEAUGenTree@@W4RefType@@@Z       : 252824560  : +13.35%  : 7.02%  : +0.1091%
?resolveRegisters@LinearScan@@QEAAXXZ                                                 : 190399730  : +10.24%  : 5.28%  : +0.0822%
?TakesEvexPrefix@emitter@@QEBA_NPEBUinstrDesc@1@@Z                                    : 180090394  : +100.67% : 5.00%  : +0.0777%
?addRefsForPhysRegMask@LinearScan@@AEAAX_KIW4RefType@@_N@Z                            : 177658558  : +40.90%  : 4.93%  : +0.0767%
?buildIntervals@LinearScan@@QEAAXXZ                                                   : 112180003  : +8.50%   : 3.11%  : +0.0484%
?associateRefPosWithInterval@LinearScan@@AEAAXPEAVRefPosition@@@Z                     : 93370218   : +7.66%   : 2.59%  : +0.0403%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                          : 91125460   : +1.84%   : 2.53%  : +0.0393%
?allocateReg@LinearScan@@AEAA?AW4_regNumber_enum@@PEAVInterval@@PEAVRefPosition@@@Z   : 63015313   : +1.36%   : 1.75%  : +0.0272%
?compExactlyDependsOn@Compiler@@AEBA_NW4CORINFO_InstructionSet@@@Z                    : 33734585   : +81.86%  : 0.94%  : +0.0146%
?compInitOptions@Compiler@@IEAAXPEAVJitFlags@@@Z                                      : 3873920    : +0.91%   : 0.11%  : +0.0017%

A PIN per-function diff between this PR and the baseline in main shows these regressions (once again, I've omitted the improved functions). The results are similar to above, except a number of emitter functions show up in the list:

Base: 231257791689, Diff: 234112656395, +1.2345%

?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z                        : 1285286530 : +50.45%  : 28.84% : +0.5558%
?allocateRegisters@LinearScan@@QEAAXXZ                                                 : 527183304  : +10.14%  : 11.83% : +0.2280%
?TakesEvexPrefix@emitter@@QEBA_NPEBUinstrDesc@1@@Z                                     : 360761464  : NA       : 8.09%  : +0.1560%
?newRefPositionRaw@LinearScan@@AEAAPEAVRefPosition@@IPEAUGenTree@@W4RefType@@@Z        : 252587840  : +13.36%  : 5.67%  : +0.1092%
?resolveRegisters@LinearScan@@QEAAXXZ                                                  : 190283162  : +10.25%  : 4.27%  : +0.0823%
?addRefsForPhysRegMask@LinearScan@@AEAAX_KIW4RefType@@_N@Z                             : 177490565  : +40.96%  : 3.98%  : +0.0768%
?buildIntervals@LinearScan@@QEAAXXZ                                                    : 112181570  : +8.50%   : 2.52%  : +0.0485%
?insEncodeReg012@emitter@@QEAAIPEBUinstrDesc@1@W4_regNumber_enum@@W4emitAttr@@PEA_K@Z  : 93457761   : +50.78%  : 2.10%  : +0.0404%
?associateRefPosWithInterval@LinearScan@@AEAAXPEAVRefPosition@@@Z                      : 93282759   : +7.66%   : 2.09%  : +0.0403%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                           : 91032693   : +1.84%   : 2.04%  : +0.0394%
?AddSimdPrefixIfNeeded@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z                 : 70216724   : +157.30% : 1.58%  : +0.0304%
?insEncodeReg345@emitter@@QEAAIPEBUinstrDesc@1@W4_regNumber_enum@@W4emitAttr@@PEA_K@Z  : 66055896   : +99.82%  : 1.48%  : +0.0286%
?allocateReg@LinearScan@@AEAA?AW4_regNumber_enum@@PEAVInterval@@PEAVRefPosition@@@Z    : 65937789   : +1.42%   : 1.48%  : +0.0285%
?AddSimdPrefixIfNeededAndNotPresent@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z    : 63784904   : +219.43% : 1.43%  : +0.0276%
?emitOutputAM@emitter@@QEAAPEAEPEAEPEAUinstrDesc@1@_KPEAUCnsVal@1@@Z                   : 51543523   : +9.08%   : 1.16%  : +0.0223%
?compExactlyDependsOn@Compiler@@AEBA_NW4CORINFO_InstructionSet@@@Z                     : 33756085   : +81.24%  : 0.76%  : +0.0146%
?emitGetAdjustedSize@emitter@@QEBAIPEAUinstrDesc@1@_K@Z                                : 32275401   : +13.46%  : 0.72%  : +0.0140%
?emitInsSizeAM@emitter@@QEAAIPEAUinstrDesc@1@_K@Z                                      : 14255524   : +4.16%   : 0.32%  : +0.0062%
?emitInsSizeSVCalcDisp@emitter@@QEAAIPEAUinstrDesc@1@_KHH@Z                            : 13690387   : +24.15%  : 0.31%  : +0.0059%
?TakesRexWPrefix@emitter@@QEBA_NPEBUinstrDesc@1@@Z                                     : 13372542   : +2.68%   : 0.30%  : +0.0058%
?emitOutputSV@emitter@@QEAAPEAEPEAEPEAUinstrDesc@1@_KPEAUCnsVal@1@@Z                   : 9425469    : +4.66%   : 0.21%  : +0.0041%
?compInitOptions@Compiler@@IEAAXPEAVJitFlags@@@Z                                       : 9142373    : +2.17%   : 0.21%  : +0.0040%
?emitInsSize@emitter@@QEAAIPEAUinstrDesc@1@_K_N@Z                                      : 8064798    : +6.26%   : 0.18%  : +0.0035%
?AddRexWPrefix@emitter@@QEAA_KPEBUinstrDesc@1@_K@Z                                     : 7042464    : +9.57%   : 0.16%  : +0.0030%
??0LinearScan@@QEAA@PEAVCompiler@@@Z                                                   : 6152156    : +4.16%   : 0.14%  : +0.0027%

@BruceForstall
Copy link
Member

I feel like we've analyzed this enough and there is data here that can drive future TP improvement activities.

@tannergooding tannergooding merged commit c6cc201 into dotnet:main Mar 25, 2023
@tannergooding tannergooding deleted the evex branch March 25, 2023 00:18
@tannergooding
Copy link
Member Author

The results are similar to above, except a number of emitter functions show up in the list:

Right, this is basically what I covered above:

For example changing from UseEvexEncoding() && IsEvexEncodedInstruction(ins) to IsVexOrEvexEncodableInstruction(ins). This ultimately removed a lot of code, but without PGO the nested UseVEXEncoding() check doesn't get outlined

This should get fixed with PGO data, but if it still isn't outlined explicitly after the next PGO update we can refactor it a bit so the core check is inlined and the more complex bit of logic is not

@jakobbotsch
Copy link
Member

jakobbotsch commented Mar 25, 2023

I did some wall clock measurements of ASP.NET tier0 contexts:

  1. I created a .mcl of the tier 0 contexts in the current ASP.NET collection. This is 58624 contexts.
  2. I hacked superpmi to replay every context 10 times (instead of just 1) to amortize the cost of the file handling
  3. I did 10 replays of the .mcl with a base and diff jit, where the diff jit is with this commit, and the base jit is the parent commit.

The results were the following, in milliseconds, taken from SPMI's output:

base = {25776.552800, 26043.440300, 25901.721800, 25787.304400, 
   26002.092800, 25677.519400, 25641.274600, 25619.626000, 
   25898.820300, 26292.012000};
diff = {26312.449900, 26448.550100, 26414.532700, 26370.734300, 
   26243.087700, 26335.908700, 26299.542700, 26181.570600, 
   26385.097500, 26297.585800};

The 95% confidence interval of this is a regression in wall clock time of between 1.2% and 2.4%. This includes time spent in the SPMI part of the JIT-EE interface, so the real regression is a bit higher (though probably not much, IIRC we spend around 10-15% of SPMI replay in SPMI).

Just some more data to help prioritize follow-up work. Maybe open an issue to track recuperating some of this with an appropriate milestone?

@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label Mar 27, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Apr 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants