-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM assembly support? #31
Comments
Tom Leavy <[email protected]> wrote:
I'm attempting to work on a PoC using mobile devices, wondering if
anyone has attempted to translate the AVX2 code over to something that
is compatible with ARM64 CPUs?
Not yet as far as I know, but an ARM-NEON implementation is certainly on
our TODO list.
|
Don't have much experience with ASM coding but maybe I can find some time to try and kick start the process. I'm assuming a command by command port would be fine, unless there are any particular nuances I should know about? |
Tom Leavy <[email protected]> wrote:
Don't have much experience with ASM coding but maybe I can find some
time to try and kick start the process.
That would of course be great and highly appreciated!
I'm assuming a command by command port would be fine, unless there are
any particular nuances I should know about?
That won't quite work. The AVX2 implementation uses 256-bit vector
registers, ARM-NEON registers have a length of only 128 bits. Also,
instructions operating on those registers are slightly different, in
particular the multiplication instructions.
|
Hi @cryptojedi and @tomleavy, I have my NEON implementation is at https://github.com/cothan/kyber/tree/round3 This is my NEON benchmark on ARMv8, Rasberry Pi 4 8Gb 1.9 Ghz (overclocked). Just Polynomial Multiplication module in Kyber.
And this is reference code on the same platform
|
Good stuff @cothan . Will gladly utilize your version when it's ready |
cothan <[email protected]> wrote:
Hi @cryptojedi and @tomleavy,
Hi @cothan,
I have my NEON implementation is at https://github.com/cothan/kyber/tree/round3
The NTT part is speed up by 6x. Overal speed up is 1.5x. I'm close to
finish NEON implementation, the critical component is Keccak, I have
my NEON code written for Keccak, my preliminary result for NEON Keccak
is a fraction of speed up.
This sounds great! Would you be willing to send a PR to upstream?
Just a question: the benchmarks below look more like a factor-4 speedup
for the NTT and a factor-3 speedup for the inverse NTT. That's great,
but not quite the factor of 6 you're mentioning. Am I missing something?
… This is my NEON benchmark on ARMv8, Rasberry Pi 4 8Gb 1.9 Ghz. Just Polynomial Multiplication module in Kyber.
```
NTT:
median: 72 cycles/ticks
average: 72 cycles/ticks
INVNTT:
median: 115 cycles/ticks
average: 115 cycles/ticks
kyber_keypair:
median: 7162 cycles/ticks
average: 7185 cycles/ticks
kyber_encaps:
median: 8207 cycles/ticks
average: 8247 cycles/ticks
kyber_decaps:
median: 8313 cycles/ticks
average: 8342 cycles/ticks
```
And this is reference code on the same platform
```
NTT:
median: 281 cycles/ticks
average: 281 cycles/ticks
INVNTT:
median: 358 cycles/ticks
average: 358 cycles/ticks
kyber_keypair:
median: 11160 cycles/ticks
average: 11198 cycles/ticks
kyber_encaps:
median: 12835 cycles/ticks
average: 12877 cycles/ticks
kyber_decaps:
median: 14646 cycles/ticks
average: 14696 cycles/ticks
```
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#31 (comment)
|
@cryptojedi |
So I'm kinda finish my implementation. Okay, let's see the benchmark for .NEON
.C_REF
Benchmark on ARMv8, Pi 4 8Gb 1.9 Ghz (overclocked). https://github.com/cothan/kyber/tree/round3 Next week I will create a PR. |
cothan <[email protected]> wrote:
So I'm kinda finish my implementation.
There are 2 functions left, `poly_tomsg` and `poly_frommsg`. It won't affect much the total execution time, but anyway they need to be implemented.
Is there a reason for not using the implementations from the "ref"
implementation?
|
Hi Peter, Let me give some perspective on this. The avx2 optimized implementations of the (de)compression functions including poly_{to,from}msg are around 10x faster than the reference versions. This results in a 30% speed-up of indcpa_enc and a >60% speed-up of indcpa_dec after our other optimizations have been applied. In the KEM this difference gets drowned out because of the huge fraction of time spend on hashing the public key and ciphertext due to our very conservative choice of CCA transform. But we'll soon see numbers from Intel CPUs with hardware support for SHA2 (and much higher AES throughput due to the new vector AES instructions that speed-up the matrix expansion and noise sampling). Cheers, |
Gregor Seiler <[email protected]> wrote:
Hi Peter,
Hi Gregor,
Let me give some perspective on this. The avx2 optimized
implementations of the (de)compression functions including
poly_{to,from}msg are around 10x faster than the reference versions.
This results in a 30% speed-up of indcpa_enc and a >60% speed-up of
indcpa_dec after our other optimizations have been applied. In the KEM
this difference gets drowned out because of the huge fraction of time
spend on hashing the public key and ciphertext due to our very
conservative choice of CCA transform. But we'll soon see numbers from
Intel CPUs with hardware support for SHA2 (and much higher AES
throughput due to the new vector AES instructions that speed-up the
matrix expansion and noise sampling).
Oh, didn't mean to say that we don't want an optimized implementation
eventually; I was just wondering if there is any reason that the
existing C implementation wouldn't work.
Cheers,
Peter
|
Hello @tomleavy @cryptojedi and @gregorseiler, We have made a fork of the Kyber repo and made an ARM32 and ARM64 implementation of Kyber: We have also made a Java wrapper for Kyber: And we are currently working on a Python bindings for Kyber as we speak as well. We hope this helps and to contribute to the Kyber project as well as its implementation on multiple platforms. Cheers, The Beechat Network team |
After iteration of code, finally my NEON ARMv8 code is complete. I also have some benchmarks on Apple M1. The speed up is a bit better.
EDIT: the unit is in cycles. I think this neon version is ready, what else should I do to make a pull request? |
@cothan Could you also report the CCs for your latest code running on your Raspberry Pi's? |
Yes, sure, I conduct the result by using a patch of PAPI for Cortex-A72. https://github.com/cothan/PAPI_ARMv8_Cortex_A72 Note: |
@cothan which OS and compiler did you use on Raspberry PI? |
Hi @marco-palumbi , Here is the configuration in my RP4:
I use Clang across all my results. |
Thanks a lot @cothan for sharing your work on developing optimized PQC implementations for Arm, it's great to see more interest and progress in this area. Note that you can further improve performance by using the fixed-point doubling multiply-high Finally, you can use the fixed-point multiply-high-accumulate If you publish or present your code with those modifications, a mention would be appreciated. A cite-able paper on the use of |
Hi @hanno-arm, Yes, you're right. It is the multiply instruction I'm looking for. I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction. I tried to look similar instruction in the past, due to the time limit, so I don't know much about rounding and saturating. Thank you very much. I will definitely mention you, and cite your work. |
Great! Here's a complete patch, by the way: #define fqmul(out, in, zeta, t) \
t.val[0] = (int16x8_t)vqdmulhq_s16(in, zeta); /* (2*a)_H */ \
t.val[1] = (int16x8_t)vmulq_s16(in, zeta); /* a_L */ \
t.val[2] = vmulq_s16(t.val[1], neon_qinv); /* a_L = a_L * QINV */ \
t.val[3] = (int16x8_t)vqdmulhq_s16(t.val[2], neon_kyberq); /* (2*a_L*Q)_H */ \
out = vhsubq_s16(t.val[0], t.val[3]); /* ((2*a)_H - (2*a_L*Q)_H)/2 */ This, of course, doesn't yet implement constant merging, though. It would be interesting see what difference it actually makes on various CPUs. Since the bottleneck may be the multiplication sequence, rather than the boilerplate around it, it might not be as large as one would expect from the mere instruction count.
Yes, will do! |
Note that MVE and SVE also have separate high/low multiply instructions. It is also noteworthy that in contrast to AVX2, those instructions aren't limited to 16-bit lanes, which is of interest e.g. for Dilithium or an NTT-based implementation of Saber. When looking through your code, I noticed that you often duplicate twiddle factors across vectors. Have you considered loading multiple twiddles into a single vector and using lane-indexed instructions? This should work for layers 0-4 and save you some instructions (and free up some vectors). |
@cothan hello,is this "neon" version sensitive to big-endian and little-endian mode? |
Hi @Ruofei-Li, I only test my code in A72 and Apple M1 so I use the default endian on such platform (whatever endian it is). |
I'm attempting to work on a PoC using mobile devices, wondering if anyone has attempted to translate the AVX2 code over to something that is compatible with ARM64 CPUs?
The text was updated successfully, but these errors were encountered: