Add BSF and BSR fallbacks for BitOperations methods #34550

saucecontrol · 2020-04-04T22:12:44Z

Testing the assertion @tannergooding made in #953 (comment) that adding x86 intrinsics is as easy as 1,2,3,4,5,6,7 😄

I wanted to get feedback on the introduction of an X86Base ISA to mirror what we have on ARM, and an X86Base class in S.R.I to house the methods for new instructions. I recall @jkotas mentioning the possibility of exposing some of the x86 base instructions as internal HWIntrinsics that could be used from higher-level methods, although I can't seem to find the discussion now.

I've added HWIntrinsics implementations of BSF and BSR and used them as fallbacks in some of the BitOperations methods that use LZCNT and TZCNT. Since BMI1 requires VEX encoding, it won't be supported on older processors or newer non-VEX processors like the Intel Atom line. The fallbacks will also improve codegen in R2R images, where VEX encoding is disabled.

Benchmark results with BMI1 and LZCNT disabled:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.300-preview-015048
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
  Job-XPOYMR : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-ROWHAK : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
LeadingZeroCount_uint	\corerun-bsf-bsr\CoreRun.exe	1,176.0 ns	8.28 ns	7.74 ns	1,176.4 ns	1,161.6 ns	1,188.7 ns	0.30	0.01	-	-	-	-
LeadingZeroCount_uint	\corerun-master\CoreRun.exe	3,867.1 ns	172.27 ns	198.39 ns	3,855.8 ns	3,509.7 ns	4,214.0 ns	1.00	0.00	-	-	-	-

LeadingZeroCount_ulong	\corerun-bsf-bsr\CoreRun.exe	1,156.3 ns	22.38 ns	23.94 ns	1,158.5 ns	1,115.4 ns	1,204.8 ns	0.36	0.01	-	-	-	-
LeadingZeroCount_ulong	\corerun-master\CoreRun.exe	3,185.6 ns	73.96 ns	82.21 ns	3,178.5 ns	3,091.1 ns	3,362.5 ns	1.00	0.00	-	-	-	-

Log2_uint	\corerun-bsf-bsr\CoreRun.exe	883.6 ns	9.93 ns	9.29 ns	880.6 ns	871.2 ns	903.2 ns	0.31	0.01	-	-	-	-
Log2_uint	\corerun-master\CoreRun.exe	2,806.4 ns	44.82 ns	41.93 ns	2,807.6 ns	2,730.5 ns	2,872.6 ns	1.00	0.00	-	-	-	-

Log2_ulong	\corerun-bsf-bsr\CoreRun.exe	875.5 ns	12.38 ns	11.58 ns	876.3 ns	856.2 ns	896.4 ns	0.22	0.00	-	-	-	-
Log2_ulong	\corerun-master\CoreRun.exe	4,058.1 ns	42.23 ns	39.51 ns	4,059.0 ns	4,001.2 ns	4,138.1 ns	1.00	0.00	-	-	-	-

TrailingZeroCount_uint	\corerun-bsf-bsr\CoreRun.exe	885.0 ns	8.74 ns	7.30 ns	885.3 ns	874.4 ns	897.3 ns	0.52	0.08	-	-	-	-
TrailingZeroCount_uint	\corerun-master\CoreRun.exe	1,710.8 ns	206.28 ns	237.55 ns	1,607.4 ns	1,509.7 ns	2,174.5 ns	1.00	0.00	-	-	-	-

TrailingZeroCount_ulong	\corerun-bsf-bsr\CoreRun.exe	1,030.8 ns	19.25 ns	19.76 ns	1,036.3 ns	984.6 ns	1,061.7 ns	0.69	0.02	-	-	-	-
TrailingZeroCount_ulong	\corerun-master\CoreRun.exe	1,484.5 ns	45.58 ns	52.49 ns	1,470.6 ns	1,425.7 ns	1,625.9 ns	1.00	0.00	-	-	-	-

And enabled:

Method	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
LeadingZeroCount_uint	\corerun-bsf-bsr\CoreRun.exe	632.8 ns	8.39 ns	7.85 ns	630.3 ns	624.2 ns	649.7 ns	0.99	0.02	-	-	-	-
LeadingZeroCount_uint	\corerun-master\CoreRun.exe	636.9 ns	8.66 ns	8.10 ns	636.2 ns	623.9 ns	653.7 ns	1.00	0.00	-	-	-	-

LeadingZeroCount_ulong	\corerun-bsf-bsr\CoreRun.exe	634.0 ns	8.36 ns	7.82 ns	632.2 ns	624.7 ns	649.9 ns	1.00	0.01	-	-	-	-
LeadingZeroCount_ulong	\corerun-master\CoreRun.exe	631.2 ns	8.40 ns	7.86 ns	629.3 ns	616.1 ns	643.2 ns	1.00	0.00	-	-	-	-

Log2_uint	\corerun-bsf-bsr\CoreRun.exe	729.5 ns	14.35 ns	12.72 ns	728.4 ns	711.4 ns	751.3 ns	0.70	0.05	-	-	-	-
Log2_uint	\corerun-master\CoreRun.exe	1,021.8 ns	62.09 ns	71.51 ns	1,005.0 ns	926.7 ns	1,175.7 ns	1.00	0.00	-	-	-	-

Log2_ulong	\corerun-bsf-bsr\CoreRun.exe	735.5 ns	7.79 ns	7.29 ns	737.1 ns	720.3 ns	748.9 ns	0.42	0.01	-	-	-	-
Log2_ulong	\corerun-master\CoreRun.exe	1,743.7 ns	33.28 ns	34.18 ns	1,739.0 ns	1,684.3 ns	1,821.5 ns	1.00	0.00	-	-	-	-

TrailingZeroCount_uint	\corerun-bsf-bsr\CoreRun.exe	626.5 ns	9.07 ns	8.48 ns	625.5 ns	614.8 ns	640.9 ns	0.99	0.02	-	-	-	-
TrailingZeroCount_uint	\corerun-master\CoreRun.exe	633.5 ns	8.48 ns	7.93 ns	631.2 ns	622.1 ns	648.7 ns	1.00	0.00	-	-	-	-

TrailingZeroCount_ulong	\corerun-bsf-bsr\CoreRun.exe	636.7 ns	11.54 ns	10.79 ns	632.3 ns	621.0 ns	661.9 ns	1.01	0.02	-	-	-	-
TrailingZeroCount_ulong	\corerun-master\CoreRun.exe	630.5 ns	7.80 ns	7.29 ns	627.5 ns	620.2 ns	648.0 ns	1.00	0.00	-	-	-	-

Note the nice little improvement to Log2 by replacing a subtraction with xor to reverse the index returned by LZCNT

tannergooding · 2020-04-04T23:18:57Z

I probably won't get to reviewing this before Monday, but assigning myself so I don't forget 😄

src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs

tannergooding · 2020-04-07T15:57:21Z

Looks like something isn't quite hooked up right as X86Base.IsSupported is returning false for various tests.

tannergooding · 2020-04-07T15:57:57Z

CC. @CarolEidt and @echesakovMSFT

Also CC. @davidwrighton to help validate the new IsSupported check is hooked up correctly

src/coreclr/src/inc/corinfoinstructionset.h

src/coreclr/src/jit/codegen.h

tannergooding · 2020-04-07T16:14:16Z

src/coreclr/src/jit/emitxarch.cpp

@@ -14818,6 +14819,8 @@ emitter::insExecutionCharacteristics emitter::getInsExecutionCharacteristics(ins
            result.insLatency += PERFSCORE_LATENCY_2C;
            break;

+        case INS_bsf:
+        case INS_bsr:


It would be nice if we had a way to model Intel vs AMD here.

Instruction Intel AMD

BSF 1.00 4.00

BSR 1.00 5.00

POPCNT 1.00 0.50

LZCNT 1.00 0.50

TZCNT 1.00 0.50

For the non-memory encoding, AMD is actually a bit faster still

Numbers taken from https://uops.info/table.html

tannergooding · 2020-04-07T16:16:32Z

src/coreclr/src/jit/hwintrinsiccodegenxarch.cpp

+// Arguments:
+//    node - The hardware intrinsic node
+//
+void CodeGen::genX86BaseIntrinsic(GenTreeHWIntrinsic* node)


Not needed for this PR, but we allowed HW_Category_Scalar intrinsics to be table driven in importation.
It would be nice to do the same here in codegen...

src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs

...ies/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.PlatformNotSupported.cs

saucecontrol · 2020-04-07T17:42:25Z

Looks like something isn't quite hooked up right as X86Base.IsSupported is returning false for various tests.

I saw that ArmBase is explicitly added with the Vector base ISAs, even though it looks like it should already be there from the VM side:

https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.cpp#L2313-L2320

https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/codeman.cpp#L1485-L1487

I couldn't see any cases where X86Base wasn't working, but maybe it needs the same treatment if you're seeing a failure.

tannergooding · 2020-04-07T17:55:17Z

I couldn't see any cases where X86Base wasn't working, but maybe it needs the same treatment if you're seeing a failure.

Seeing failures in CI for x86.

saucecontrol · 2020-04-07T18:04:27Z

Ah, I found one:

System.PlatformNotSupportedException : Operation is not supported on this platform.
      Stack Trace:
        /_/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.cs(34,0): at System.Runtime.Intrinsics.X86.X86Base.X64.BitScanReverse(UInt64 value)

So it looks like X86Base.X64.IsSupported is returning true when it shouldn't. I'll investigate

src/libraries/System.Utf8String.Experimental/src/System/Runtime/Intrinsics/Intrinsics.Shims.cs

src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs

tannergooding · 2020-04-07T21:22:42Z

We need the instruction sets added here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.cpp#L2205, in the same way as we do for ARM here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.cpp#L2316

It looks like all the other JIT bits exist (comparing to ArmBase and Sse.X64)

saucecontrol · 2020-04-07T21:36:21Z

The way I read it, the explicit addition of ArmBase is redundant because it's passed in from the VM side in the same way the higher-level ISAs are. Since I've added X86Base unconditionally here:

https://github.com/dotnet/runtime/blob/47bc99ba0b45f8cf2144e590013c61ad911865db/src/coreclr/src/vm/codeman.cpp#L1302

It's always available unless removed based on the config knobs, just like SSE+ are.

Am I missing something there?

tannergooding · 2020-04-07T21:43:27Z

The way I read it, the explicit addition of ArmBase is redundant because it's passed in from the VM side in the same way the higher-level ISAs are. Since I've added X86Base unconditionally here:

@davidwrighton, did the base checks get inverted? That is should https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.cpp#L2205 be:

if (!JitConfig.EnableHWIntrinsic())
{
    // Dummy ISAs for simplifying the JIT code
    instructionSetFlags.RemoveInstructionSet(InstructionSet_Vector128);
    instructionSetFlags.RemoveInstructionSet(InstructionSet_Vector256);
}

You are meant to be able to disable the "baseline" as well, so you can test your software fallback.

saucecontrol · 2020-04-07T21:49:28Z

The Vector ISAs don't actually exist, so they're not passed into the JIT. Unless JitConfig.EnableHWIntrinsic is true, they never get added so you get the software fallbacks.

The real ISAs work the opposite way. The VM passes them in if the hardware supports them and the JIT disables them if the config says to.

X86Base is working as expected with that model. It reports IsSupported if COMPlus_EnableHWIntrinsic is on and not if it's not.

saucecontrol · 2020-04-07T22:20:08Z

It's probably a good idea to make ArmBase and X86Base handled consistently in compiler.cpp. I actually found the explicit addition of ArmBase with the dummy ISAs confusing, so I'd be in favor of removing that if it's not actually necessary.

If I've misunderstood any of that, I guess we should add X86Base explicitly to make them match 🤷‍♂

davidwrighton · 2020-04-07T22:32:36Z

@tannergooding @saucecontrol Vector64/128/256 are set by the JIT itself as Tanner noted.

One of the changes I did not realize I made was to make ArmBase set by the VM side. One of the issues is that the VM now attempts to pass a rationalized set of instruction set values to the JIT, and not enabling ArmBase, then requires the VM side to disable all of the other instruction sets. We could tweak that if we needed to but I need to know its necessary before I futz with it.

I would avoid marking the X86Base as being implied by support for SSE, as that means if it is disabled, then SSE will be disabled.

davidwrighton · 2020-04-07T22:34:03Z

And yes, the current model doesn't make sense where we explicitly set the ArmBase instruction set in both the jit and the VM side of the fence.

saucecontrol · 2020-04-07T22:41:03Z

I would avoid marking the X86Base as being implied by support for SSE, as that means if it is disabled, then SSE will be disabled.

There's presently no way to disable X86Base other than to disable HWIntrinsics entirely. Isn't that the same way ArmBase works? Or would this be a case where disabling HWIntrinsics also disables SIMD?

And yes, the current model doesn't make sense where we explicitly set the ArmBase instruction set in both the jit and the VM side of the fence.

Would removing it from the JIT side be ok for now, given the larger changes necessary to separate it on the VM side?

tannergooding · 2020-04-07T22:59:29Z

There's presently no way to disable X86Base other than to disable HWIntrinsics entirely. Isn't that the same way ArmBase works? Or would this be a case where disabling HWIntrinsics also disables SIMD?

That is the expectation.

The helper types and the "base" ISAs don't have their own COMPlus knob, instead they are controlled by the COMPlus_EnableHWIntrinsics knob.
They still participate in the hierarchy and so doing EnableHWIntrinsics=0 will also disable Sse and others (we can't very well use Sse intrinsics if the user has disabled HWIntrinsics in general, after all).

tannergooding · 2020-04-07T23:01:17Z

I would avoid marking the X86Base as being implied by support for SSE, as that means if it is disabled, then SSE will be disabled.

That is exactly what we want. Disabling X86Base requires disabling HWIntrinsics in general, which should also disable SSE (only intrinsics, not general JIT support outside HWIntrinsics of course).

tannergooding · 2020-04-22T19:28:46Z

I logged #35305 to track the emitter's handling of the 'w' and 's' encoding bits.

saucecontrol · 2020-04-24T20:05:41Z

Rebased on master to bring in #35364, and updated to the new intrinsiclist table layout

saucecontrol · 2020-04-25T19:23:40Z

Test failures related to Array.Sort gave me a scare for a second, but those were apparently #35438

CarolEidt · 2020-04-27T18:40:14Z

ping @davidwrighton - could you have a look at the ISA handling here?

davidwrighton

This looks good. I believe the new instruction set should be enabled by the existing pathways in crossgen2, so no additional crossgen2 change should be needed.

saucecontrol · 2020-04-27T21:57:48Z

Rebased and re-ran ThunkGenerator to resolve conflicts with #35504 and #341

tannergooding · 2020-04-29T02:11:36Z

Thanks for the contribution @saucecontrol!

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 4, 2020

tannergooding self-assigned this Apr 4, 2020

tannergooding added arch-x64 arch-x86 labels Apr 4, 2020

jkotas reviewed Apr 5, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs Outdated Show resolved Hide resolved