Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize System.Numerics.BitOperations using arm64 intrinsics #33495

Closed
BruceForstall opened this issue Mar 11, 2020 · 7 comments · Fixed by #35636
Closed

Optimize System.Numerics.BitOperations using arm64 intrinsics #33495

BruceForstall opened this issue Mar 11, 2020 · 7 comments · Fixed by #35636

Comments

@BruceForstall
Copy link
Member

This item tracks the conversion of the System.Numerics.BitOperations class to use arm64 intrinsics.

Related: #33308

@BruceForstall BruceForstall added this to the 5.0 milestone Mar 11, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 11, 2020
@BruceForstall BruceForstall changed the title Optimize System.Numerics.BitOperations Optimize System.Numerics.BitOperations using arm64 intrinsics Mar 11, 2020
@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Mar 11, 2020
@kunalspathak
Copy link
Member

I was trying to convert BitOperations.PopCount() method to use Arm64 Intrinsics. PopCount instruction “cnt” is only available for vector registers and not for scalar registers. Even for vector registers, the argument can only be a vector holding elements of type byte or sbyte. Since BitOperations.PopCount() APIs takes uint/ulong as argument, here is what we could do:

  1. Move the value of scalar register into a vector register.
  2. Perform cnt on vector register.
  3. Move back the result from vector into scalar register.

However since “cnt” only operates on byte/sbyte, I need to use AddAcross to accumulate the result of popcount across all lanes. I wrote 2 versions of this logic and both are 6-9X slower than the software fallback we have in place today.
The code, corresponding assembly code and result are present at https://gist.github.com/kunalspathak/398a14edae6367033a6656a8a62c6486

Any other suggestion that I can try to optimize PopCount() using arm64 intrinsic or should we stick to software fallback for now?

@TamarChristinaArm @tannergooding @echesakovMSFT

@EgorBo
Copy link
Member

EgorBo commented Apr 2, 2020

Hm.. here is what clang emits for __builtin_popcount https://godbolt.org/z/JydBtp

@TamarChristinaArm
Copy link
Contributor

@kunalspathak yeah unfortunately we do only have popcount on vector registers, but your sequence is correct and should be much faster than the software fallbacks. It looks like the reason it's not is the surrounding code to create and extract the element.

Your second example Custom2_PopCount seems to be the closest to doing it right but

Vector64<uint> input = Vector64.CreateScalar(value);

should have been converted into a simple fmov

where currently you get

        D2933D80          movz    x0, #0x99ec
        F2B657E0          movk    x0, #0xb2bf LSL #16
        F2CFFEE0          movk    x0, #0x7ff7 LSL #32
        B9400000          ldr     w0, [x0]
        0E040FE0          dup     v0.2s, wzr
        FD000FA0          str     d0, [fp,#24]
        B9001BA0          str     w0, [fp,#24]
        FD400FA0          ldr     d0, [fp,#24]
        FD0017A0          str     d0, [fp,#40]
        FD4017A0          ldr     d0, [fp,#40]

That alone looks like it's more expensive than the software fallback. It looks like the JIT doesn't know how to move values between register files if not through memory.

        52800000          mov     w0, #0
        97FFAE89          bl      System.Runtime.Intrinsics.Vector64:GetElement(System.Runtime.Intrinsics.Vector64`1[Byte],int):ubyte
                                                ;; bbWeight=1    PerfScore 18.00

This should have been an fmov as well. If these are fixed your change should be faster than the fallback. I suspect GetElement is also emitting the same move through memory operations.

@TamarChristinaArm
Copy link
Contributor

Perhaps instead of

result = (int)Vector64.GetElement<byte>(added, 0);

this works better for the final bits

result = (int)added.ToScalar();

@tannergooding
Copy link
Member

Vector64 input = Vector64.CreateScalar(value);
should have been converted into a simple fmov

This is tracked by #33496 and should likely be resolved before moving forward on the other items so we don't accidentally skew benchmark numbers due to inefficient create methods.

@kunalspathak
Copy link
Member

Thanks @TamarChristinaArm for the insights. Yes, CreateScalar is something to be done as part of #33496 but I was not sure how much perf gain it would give. I will work on that API first to see gains for PopCount. Also, didn't notice ToScalar() . The code looks simpler but it doesn't make much perf impact. I will get back to it after converting CreateScalar().

@kunalspathak
Copy link
Member

Hm.. here is what clang emits for __builtin_popcount https://godbolt.org/z/JydBtp

Thanks for reminding me of godbolt @EgorBo . I keep forgetting that I can verify the code generated by others. :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants