-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements #58
Conversation
Codecov Report
@@ Coverage Diff @@
## master #58 +/- ##
==========================================
+ Coverage 49.23% 49.35% +0.11%
==========================================
Files 22 22
Lines 1694 1694
==========================================
+ Hits 834 836 +2
+ Misses 860 858 -2
Continue to review full report at Codecov.
|
863cb4a
to
97ca7c2
Compare
It would be nice if we could build binaries that automatically used the faster AVX2 implementation when available, but it looks like that would require a non-trivial probably-breaking refactor of the For now, I'll update the README and figure out how I'd need to publish split binaries. |
Runtime feature detection has come up a few times. It's something we'd like to add, but we'd also like to avoid calling CPUID in the middle of hot loops. We haven't found a satisfactory solution yet. Related thread: |
I also look at Poly1305 AVX2 this weekend. There's another Goll/Gueron paper about it, which is unfortunately behind an IEEE paywall: https://ieeexplore.ieee.org/document/7113463 Probably the best bet is to adapt @floodyberry's poly1305-opt, which uses Intel SSE/AVX2 intrinsics:
Edit: made a tracking issue specifically for SIMD Poly1305: RustCrypto/universal-hashes#46 |
https://eprint.iacr.org/2019/842 appears to describe the Goll/Gueron technique, and then improves on it. |
@str4d I took a quick look at the code linked from the paper here: https://github.com/Sreyosi/Improved-SIMD-Implementation-of-Poly1305 Their technique looks both complex and the implementation incomplete, i.e. there are large swaths of important looking code commented out including everything related to buffering. (The |
I attempted a mechanical translation of the 2019/842 eprint code using Corrode: https://github.com/RustCrypto/universal-hashes/pull/47/files I noticed a lot of the original code is commented out and likely incomplete. If someone actually wants to do the work to debug it it might make for the basis of an optimized Poly1305 implementation, but I think it probably makes more sense to start with a perhaps less optimized but known-to-be-working implementation like (that said, because |
I've now had a chance to look over the paper. It appears (from their description) that Goll/Gueron are simply doing the obvious thing of processing four blocks of the message at a time. This works great if the number of message blocks is 0 mod 4, but the Goll/Gueron implementation needs to use a regular non-parallel implementation for any trailing 1-3 blocks.
I suspect the best place to start is either |
I'm having a go at porting Goll-Gueron from scratch instead of via corrode, so I can both retain the original's structure (which is implemented via preprocessor macros and thus is lost during corrosion), and document as much of it as I can. |
|
The performance improvement comes from the upgraded
chacha20
dependency. Performance is slightly improved by default, and compiling withRUSTFLAGS="-Ctarget-feature=+avx2"
will boost performance significantly.Part of #57.