Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable LZCNT in crossgen #35598

Merged
merged 1 commit into from
Apr 30, 2020
Merged

Disable LZCNT in crossgen #35598

merged 1 commit into from
Apr 30, 2020

Conversation

saucecontrol
Copy link
Member

As a followup to #34550 (comment), this is removing LZCNT from the list of x86 ISAs whitelisted for crossgen on CoreLib.

Now that BitOperations.TrailingZeroCount has a fallback to BSR, the cost of querying Lzcnt.IsSupported before using LZCNT is higher than simply using the fallback. This is a situation very specific to LZCNT because it's most often a one-and-done instruction and the fallback is only very marginally slower.

R2RDump diff summary for System.Private.CoreLib.dll before and after this change:

Left file:  C:\corerun-before\System.Private.CoreLib.dll (10525696 B)
Right file: C:\corerun-afters\System.Private.CoreLib.dll (10523136 B)

 LEFT_SIZE RIGHT_SIZE       DIFF  R2R methods (19033 ELEMENTS)
--------------------------------------------------------------
       206        161        -45  Void System.Array+SorterObjectArray.IntrospectiveSort(Int32, Int32)
       206        161        -45  Void System.Array+SorterGenericArray.IntrospectiveSort(Int32, Int32)
      1521       1499        -22  Int32 System.Decimal+DecCalc.ScaleResult(System.Decimal+DecCalc+Buf24*, UInt32, Int32)
      1817       1781        -36  Void System.Decimal+DecCalc.VarDecDiv(ref System.Decimal+DecCalc, ref System.Decimal+DecCalc)
       783        765        -18  Void System.Decimal+DecCalc.VarDecModFull(ref System.Decimal+DecCalc, ref System.Decimal+DecCalc, Int32)
       587        561        -26  Void System.Number.Dragon4Double(Double, Int32, Boolean, ref System.Number+NumberBuffer)
       554        530        -24  Void System.Number.Dragon4Single(Single, Int32, Boolean, ref System.Number+NumberBuffer)
      3011       2975        -36  UInt32 System.Number.Dragon4(UInt64, Int32, UInt32, Boolean, Int32, Boolean, System.Span`1<Byte>, ref Int32)
       224        205        -19  String System.Number.Int32ToHexStr(Int32, Char, Int32)
       238        226        -12  Boolean System.Number.TryInt32ToHexStr(Int32, Char, Int32, System.Span`1<Char>, ref Int32)
       298        279        -19  String System.Number.Int64ToHexStr(Int64, Char, Int32)
       331        308        -23  Boolean System.Number.TryInt64ToHexStr(Int64, Char, Int32, System.Span`1<Char>, ref Int32)
       566        528        -38  UInt64 System.Number.AssembleFloatingPointBits(ref System.Number+FloatingPointInfo, UInt64, Int32, Boolean)
       757        738        -19  UInt64 System.Number.NumberToFloatingPointBitsSlow(ref System.Number+NumberBuffer, ref System.Number+FloatingPointInfo, UInt32, UInt32, UInt32)
        53         23        -30  UInt32 System.Number+BigInteger.CountSignificantBits(UInt32)
        57         25        -32  UInt32 System.Number+BigInteger.CountSignificantBits(UInt64)
        80         46        -34  UInt32 System.Number+BigInteger.CountSignificantBits(ref System.Number+BigInteger)
      1258       1228        -30  Void System.Number+BigInteger.DivRem(ref System.Number+BigInteger, ref System.Number+BigInteger, ref System.Number+BigInteger, ref System.Number+BigInteger)
       144        125        -19  System.Number+DiyFp System.Number+DiyFp.Normalize()
        60         28        -32  Int32 System.SpanHelpers.LocateLastFoundByte(UInt64)
        60         28        -32  Int32 System.SpanHelpers.LocateLastFoundChar(UInt64)
        56         17        -39  Int32 System.Numerics.BitOperations.LeadingZeroCount(UInt32)
        60         19        -41  Int32 System.Numerics.BitOperations.LeadingZeroCount(UInt64)
        53         11        -42  Int32 System.Numerics.BitOperations.Log2(UInt32)
        57         13        -44  Int32 System.Numerics.BitOperations.Log2(UInt64)
       108         90        -18  Int32 System.Buffers.Utilities.SelectBucketIndex(Int32)
        64         32        -32  Int32 System.Buffers.Text.FormattingHelpers.CountHexDigits(UInt64)
       240        181        -59  Boolean System.Buffers.Text.Utf8Formatter.TryFormatUInt64X(UInt64, Byte, Boolean, System.Span`1<Byte>, ref Int32)
   2623915    2623049       -866  <TOTAL>
R2RDump disasm for `System.SpanHelpers.LocateLastFoundByte(UInt64)`
BEFORE
=============================================================

Int32 System.SpanHelpers.LocateLastFoundByte(UInt64)
Id: 4876
StartAddress: 0x004BE290
Size: 60 bytes
UnwindRVA: 0x0086F9BC
Version:            1
Flags:              0x03 EHANDLER UHANDLER
SizeOfProlog:       0x0005
CountOfUnwindCodes: 2
FrameRegister:      RAX
FrameOffset:        0x0
PersonalityRVA:     0x2973D
UnwindCode[0]: CodeOffset 0x0005 FrameOffset 0x3205 NextOffset 0x-1 Op 32
UnwindCode[1]: CodeOffset 0x0001 FrameOffset 0x6001 NextOffset 0x-1 Op RSI(6)

Debug Info
    Bounds:
    Native Offset: 0x0, Prolog, Source Types: StackEmpty
    Native Offset: 0x8, IL Offset: 0x0000, Source Types: StackEmpty
    Native Offset: 0x2E, NoMapping, Source Types: StackEmpty
    Native Offset: 0x36, Epilog, Source Types: StackEmpty

    Variable Locations:
    Variable Number: 0
    Start Offset: 0x0
    End Offset: 0x8
    Loc Type: VLT_REG
    Register: RCX

    Variable Number: 0
    Start Offset: 0x8
    End Offset: 0x12
    Loc Type: VLT_REG
    Register: RSI

    Variable Number: 0
    Start Offset: 0x1B
    End Offset: 0x20
    Loc Type: VLT_REG
    Register: RSI

  4be290: 56                    push rsi
                                UWOP_PUSH_NONVOL RSI(6)
  4be291: 48 83 ec 20           sub rsp, 32
                                UWOP_ALLOC_SMALL 32
  4be295: 48 8b f1              mov rsi, rcx
  4be298: ff 15 0a 28 b5 ff     call qword ptr [0x10aa8]      // Boolean System.Runtime.Intrinsics.X86.Lzcnt+X64.get_IsSupported() (METHOD_ENTRY_DEF_TOKEN)
  4be29e: 84 c0                 test al, al
  4be2a0: 74 09                 je 0x4BE2AB
  4be2a2: 33 c0                 xor eax, eax
  4be2a4: f3 48 0f bd c6        lzcnt rax, rsi
  4be2a9: eb 13                 jmp 0x4BE2BE
  4be2ab: 48 85 f6              test rsi, rsi
  4be2ae: 74 09                 je 0x4BE2B9
  4be2b0: 48 0f bd c6           bsr rax, rsi
  4be2b4: 83 f0 3f              xor eax, 63
  4be2b7: eb 05                 jmp 0x4BE2BE
  4be2b9: b8 40 00 00 00        mov eax, 64
  4be2be: c1 f8 03              sar eax, 3
  4be2c1: f7 d8                 neg eax
  4be2c3: 83 c0 07              add eax, 7
  4be2c6: 48 83 c4 20           add rsp, 32
  4be2ca: 5e                    pop rsi
  4be2cb: c3                    ret

=============================================================

AFTER
=============================================================

Int32 System.SpanHelpers.LocateLastFoundByte(UInt64)
Id: 4876
StartAddress: 0x004BE0B0
Size: 28 bytes
UnwindRVA: 0x00868D54
Version:            1
Flags:              0x03 EHANDLER UHANDLER
SizeOfProlog:       0x0000
CountOfUnwindCodes: 0
FrameRegister:      RAX
FrameOffset:        0x0
PersonalityRVA:     0x2973D

Debug Info
    Bounds:
    Native Offset: 0x0, Prolog, Source Types: StackEmpty
    Native Offset: 0x0, IL Offset: 0x0000, Source Types: StackEmpty
    Native Offset: 0x13, NoMapping, Source Types: StackEmpty
    Native Offset: 0x1B, Epilog, Source Types: StackEmpty

    Variable Locations:
    Variable Number: 0
    Start Offset: 0x0
    End Offset: 0x1
    Loc Type: VLT_REG
    Register: RCX

    Variable Number: 0
    Start Offset: 0x0
    End Offset: 0x5
    Loc Type: VLT_REG
    Register: RCX

  4be0b0: 48 85 c9              test rcx, rcx
  4be0b3: 74 09                 je 0x4BE0BE
  4be0b5: 48 0f bd c1           bsr rax, rcx
  4be0b9: 83 f0 3f              xor eax, 63
  4be0bc: eb 05                 jmp 0x4BE0C3
  4be0be: b8 40 00 00 00        mov eax, 64
  4be0c3: c1 f8 03              sar eax, 3
  4be0c6: f7 d8                 neg eax
  4be0c8: 83 c0 07              add eax, 7
  4be0cb: c3                    ret

=============================================================

Note: Lzcnt.LeadingZeroCount is called only from BitOperations.LeadingZeroCount in CoreLib. All changes shown are from the inlined version of BitOperations.LeadingZeroCount.

cc @tannergooding @davidwrighton

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this PR. Please help me learn by adding exactly one area label.

@tannergooding
Copy link
Member

This is also relevant to #33737 which explains that LZCNT on Intel based CPUs is an AVX2/BMI1/BMI2 era instruction and so it isn't from the same general timeframe as the others even if it is encodable with VEX support.

@davidwrighton
Copy link
Member

For crossgen I believe this change is strictly an improvement. However, I see that you did not make the corresponding change in crossgen2. However, this isn't necessarily the wrong choice. Crossgen2 does not use the IsSupported function call mechanism, but instead it performs speculative compilation assuming the processor intrinsic IS available which yields excellent performance for this case on Haswell and newer processors.

@saucecontrol
Copy link
Member Author

Thanks, I didn't see that issue. The basic problem is that the crossgen1 support for opportunistic ISA use relies on the cost of the IsSupported check being amortized over the improved alternate HWIntrinsic implementation. With POPCNT and LZCNT, that means it's basically amortized over a single instruction, whereas SSSE3 or SSE4.1 use might be a fully different algorithm with a lot of instructions. POPCNT is a tricky one because we only have a slow software fallback, but LZCNT is a no-brainer now that the fallback is cheap.

I've also been looking at BitOps.LeadingZeroCount use across CorLib, and it turns out most of them are really calculating log2. If those are switched to call BitOperations.Log2 instead, the fallback becomes even cheaper. I'll open a PR for that in a bit.

@saucecontrol
Copy link
Member Author

I see that you did not make the corresponding change in crossgen2. However, this isn't necessarily the wrong choice. Crossgen2 does not use the IsSupported function call mechanism, but instead it performs speculative compilation assuming the processor intrinsic IS available which yields excellent performance for this case on Haswell and newer processors.

I haven't dug into the crossgen2 code, but I read your design document, and this was my understanding. Seems to only be an issue with the way it's handled in crossgen1.

@davidwrighton
Copy link
Member

Agreed. I don't think you should change the crossgen2 behavior. I just wanted to make it clear that I've looked at this, the improvement looks right, but it shouldn't be applied to crossgen2 as it won't really be an improvement there. (I'd have marked this as approved, but I wanted to see a clean set of tests first.)

@tannergooding
Copy link
Member

However, this isn't necessarily the wrong choice. Crossgen2 does not use the IsSupported function call mechanism, but instead it performs speculative compilation assuming the processor intrinsic IS available which yields excellent performance for this case on Haswell and newer processors.

Do we have a rough percentage of CPUs that are Haswell or later that would justify whether or not this is actually worthwhile? If you go based off of the March 2020 Steam Hardware Survey: https://store.steampowered.com/hwsurvey/

Then roughly 75% of reported CPUs support AVX2 (and would therefore have BMI1/BMI2/LZCNT, based on what Intel/AMD have shipped):
image

This naturally doesn't take into account developer machines, server hardware, etc. But, if we are saying LZCNT is likely to be supported, then there is an equal number of AVX2/BMI1/BMI2 machines in the same boat and a greater number of AVX machines and we should likely consider what is a "good baseline".

@davidwrighton
Copy link
Member

Well, for crossgen1 I believe the approach that @saucecontrol is taking is strictly better than utilizing lzcnt.

For crossgen2 its a more subtle question, and comes down to how much of a penalty will customers suffer from having to run the JIT vs having the slightly better codegen. If we trust the steam survey (which we probably shouldn't, but its somewhat representative of end users running on client hardware who care about performance), we would be trading off slightly better codegen for 75% of the customer base vs 1-2 ms of performance loss on startup for ~25% of the client workload. For the server workload, which is entirely dominated by servers running in the cloud or other high end environments, the numbers are even more supportive of relying on higher specced machines. For instance, while Azure doesn't provide a guarantee for all instance types that they support have Avx2, the support is broadly available in most hardware, as the cloud has grown by orders of magnitude since Haswell was the right chip to buy. In particular, all of our very large first party customers which utilize Azure run exclusively on instance types which support Avx2.

@danmoseley
Copy link
Member

Since 5.0 will mostly be adopted in 2021 presumably the question is really what the availability will have reached then.

@davidwrighton
Copy link
Member

Well, for crossgen2, its really about 2022 as we don't expect it to get widespread use until .NET 6. I feel that the cost on crossgen2, we've already reached the threshold for being willing to just use the lzcnt instruction, but only just barely. In the next 2 years we should be comfortable using an instruction set that will be 8-9 years old at that point.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the correct change for Crossgen.

I believe what we currently have for Crossgen2 is sufficient due to how it differs

@davidwrighton davidwrighton merged commit 6ed9f1b into dotnet:master Apr 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants