Optimize System.Numerics.BitOperations using arm64 intrinsics #33495

BruceForstall · 2020-03-11T21:36:47Z

This item tracks the conversion of the System.Numerics.BitOperations class to use arm64 intrinsics.

Related: #33308

kunalspathak · 2020-04-02T09:25:22Z

I was trying to convert BitOperations.PopCount() method to use Arm64 Intrinsics. PopCount instruction “cnt” is only available for vector registers and not for scalar registers. Even for vector registers, the argument can only be a vector holding elements of type byte or sbyte. Since BitOperations.PopCount() APIs takes uint/ulong as argument, here is what we could do:

Move the value of scalar register into a vector register.
Perform cnt on vector register.
Move back the result from vector into scalar register.

However since “cnt” only operates on byte/sbyte, I need to use AddAcross to accumulate the result of popcount across all lanes. I wrote 2 versions of this logic and both are 6-9X slower than the software fallback we have in place today.
The code, corresponding assembly code and result are present at https://gist.github.com/kunalspathak/398a14edae6367033a6656a8a62c6486

Any other suggestion that I can try to optimize PopCount() using arm64 intrinsic or should we stick to software fallback for now?

@TamarChristinaArm @tannergooding @echesakovMSFT

EgorBo · 2020-04-02T09:33:19Z

Hm.. here is what clang emits for __builtin_popcount https://godbolt.org/z/JydBtp

TamarChristinaArm · 2020-04-02T11:38:18Z

@kunalspathak yeah unfortunately we do only have popcount on vector registers, but your sequence is correct and should be much faster than the software fallbacks. It looks like the reason it's not is the surrounding code to create and extract the element.

Your second example Custom2_PopCount seems to be the closest to doing it right but

Vector64<uint> input = Vector64.CreateScalar(value);

should have been converted into a simple fmov

where currently you get

        D2933D80          movz    x0, #0x99ec
        F2B657E0          movk    x0, #0xb2bf LSL #16
        F2CFFEE0          movk    x0, #0x7ff7 LSL #32
        B9400000          ldr     w0, [x0]
        0E040FE0          dup     v0.2s, wzr
        FD000FA0          str     d0, [fp,#24]
        B9001BA0          str     w0, [fp,#24]
        FD400FA0          ldr     d0, [fp,#24]
        FD0017A0          str     d0, [fp,#40]
        FD4017A0          ldr     d0, [fp,#40]

That alone looks like it's more expensive than the software fallback. It looks like the JIT doesn't know how to move values between register files if not through memory.

        52800000          mov     w0, #0
        97FFAE89          bl      System.Runtime.Intrinsics.Vector64:GetElement(System.Runtime.Intrinsics.Vector64`1[Byte],int):ubyte
                                                ;; bbWeight=1    PerfScore 18.00

This should have been an fmov as well. If these are fixed your change should be faster than the fallback. I suspect GetElement is also emitting the same move through memory operations.

TamarChristinaArm · 2020-04-02T12:17:31Z

Perhaps instead of

result = (int)Vector64.GetElement<byte>(added, 0);

this works better for the final bits

result = (int)added.ToScalar();

tannergooding · 2020-04-02T15:04:48Z

Vector64 input = Vector64.CreateScalar(value);
should have been converted into a simple fmov

This is tracked by #33496 and should likely be resolved before moving forward on the other items so we don't accidentally skew benchmark numbers due to inefficient create methods.

kunalspathak · 2020-04-02T19:45:39Z

Thanks @TamarChristinaArm for the insights. Yes, CreateScalar is something to be done as part of #33496 but I was not sure how much perf gain it would give. I will work on that API first to see gains for PopCount. Also, didn't notice ToScalar() . The code looks simpler but it doesn't make much perf impact. I will get back to it after converting CreateScalar().

kunalspathak · 2020-04-02T19:46:15Z

Hm.. here is what clang emits for __builtin_popcount https://godbolt.org/z/JydBtp

Thanks for reminding me of godbolt @EgorBo . I keep forgetting that I can verify the code generated by others. :)

BruceForstall added arch-arm64 area-System.Numerics labels Mar 11, 2020

BruceForstall added this to the 5.0 milestone Mar 11, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 11, 2020

BruceForstall assigned kunalspathak Mar 11, 2020

BruceForstall changed the title ~~Optimize System.Numerics.BitOperations~~ Optimize System.Numerics.BitOperations using arm64 intrinsics Mar 11, 2020

tannergooding mentioned this issue Mar 11, 2020

Optimize System.Runtime.Intrinsics using arm64 intrinsics #33496

Closed

BruceForstall mentioned this issue Mar 11, 2020

Optimize library code using arm64 intrinsics #33308

Closed

tannergooding removed the untriaged New issue has not been triaged by the area owner label Mar 11, 2020

This was referenced Apr 3, 2020

Add fmov arm64 intrinsic in JIT to implement Vector*.CreateScalarUnsafe API #34485

Closed

BitOperations arm64 intrinsic for LeadingZeroCount, TrailingZeroCount and Log2 #34486

Merged

kunalspathak mentioned this issue Apr 29, 2020

Optimize BitOperations.PopCount() with arm64 intrinsics #35636

Merged

kunalspathak closed this as completed in #35636 May 8, 2020

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize System.Numerics.BitOperations using arm64 intrinsics #33495

Optimize System.Numerics.BitOperations using arm64 intrinsics #33495

BruceForstall commented Mar 11, 2020

kunalspathak commented Apr 2, 2020

EgorBo commented Apr 2, 2020 •

edited

Loading

TamarChristinaArm commented Apr 2, 2020

TamarChristinaArm commented Apr 2, 2020

tannergooding commented Apr 2, 2020

kunalspathak commented Apr 2, 2020

kunalspathak commented Apr 2, 2020

Optimize System.Numerics.BitOperations using arm64 intrinsics #33495

Optimize System.Numerics.BitOperations using arm64 intrinsics #33495

Comments

BruceForstall commented Mar 11, 2020

kunalspathak commented Apr 2, 2020

EgorBo commented Apr 2, 2020 • edited Loading

TamarChristinaArm commented Apr 2, 2020

TamarChristinaArm commented Apr 2, 2020

tannergooding commented Apr 2, 2020

kunalspathak commented Apr 2, 2020

kunalspathak commented Apr 2, 2020

EgorBo commented Apr 2, 2020 •

edited

Loading