Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #58

Merged
merged 2 commits into from
Jan 25, 2020
Merged

Performance improvements #58

merged 2 commits into from
Jan 25, 2020

Conversation

str4d
Copy link
Owner

@str4d str4d commented Jan 17, 2020

The performance improvement comes from the upgraded chacha20 dependency. Performance is slightly improved by default, and compiling with RUSTFLAGS="-Ctarget-feature=+avx2" will boost performance significantly.

Part of #57.

@codecov
Copy link

codecov bot commented Jan 20, 2020

Codecov Report

Merging #58 into master will increase coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   49.23%   49.35%   +0.11%     
==========================================
  Files          22       22              
  Lines        1694     1694              
==========================================
+ Hits          834      836       +2     
+ Misses        860      858       -2
Impacted Files Coverage Δ
src/primitives/armor.rs 78.57% <0%> (+0.79%) ⬆️
src/openssh.rs 52.23% <0%> (+1.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84e98a2...97ca7c2. Read the comment docs.

@str4d str4d force-pushed the 57-performance-improvements branch from 863cb4a to 97ca7c2 Compare January 20, 2020 05:42
@str4d
Copy link
Owner Author

str4d commented Jan 20, 2020

It would be nice if we could build binaries that automatically used the faster AVX2 implementation when available, but it looks like that would require a non-trivial probably-breaking refactor of the sha2 crate to detect support at runtime instead of compile-time. cc @tarcieri

For now, I'll update the README and figure out how I'd need to publish split binaries.

@tarcieri
Copy link

Runtime feature detection has come up a few times. It's something we'd like to add, but we'd also like to avoid calling CPUID in the middle of hot loops. We haven't found a satisfactory solution yet.

Related thread:

RustCrypto/block-ciphers#25

@tarcieri
Copy link

tarcieri commented Jan 20, 2020

I also look at Poly1305 AVX2 this weekend. There's another Goll/Gueron paper about it, which is unfortunately behind an IEEE paywall:

https://ieeexplore.ieee.org/document/7113463

Probably the best bet is to adapt @floodyberry's poly1305-opt, which uses Intel SSE/AVX2 intrinsics:

Edit: made a tracking issue specifically for SIMD Poly1305: RustCrypto/universal-hashes#46

@str4d
Copy link
Owner Author

str4d commented Jan 20, 2020

https://eprint.iacr.org/2019/842 appears to describe the Goll/Gueron technique, and then improves on it.

@tarcieri
Copy link

tarcieri commented Jan 20, 2020

@str4d I took a quick look at the code linked from the paper here:

https://github.com/Sreyosi/Improved-SIMD-Implementation-of-Poly1305

Their technique looks both complex and the implementation incomplete, i.e. there are large swaths of important looking code commented out including everything related to buffering. (The UniversalHash trait doesn't necessarily need buffering, but the portable implementation currently provides it)

@tarcieri
Copy link

tarcieri commented Jan 20, 2020

I attempted a mechanical translation of the 2019/842 eprint code using Corrode:

https://github.com/RustCrypto/universal-hashes/pull/47/files

I noticed a lot of the original code is commented out and likely incomplete.

If someone actually wants to do the work to debug it it might make for the basis of an optimized Poly1305 implementation, but I think it probably makes more sense to start with a perhaps less optimized but known-to-be-working implementation like floodyberry/poly1305-opt.

(that said, because floodyberry/poly1305-opt has sprinkles of inline assembly it is a less straightforward translation. also the current implementation depends on the total output length in advance but it may be possible to work around that)

@str4d
Copy link
Owner Author

str4d commented Jan 20, 2020

I've now had a chance to look over the paper. It appears (from their description) that Goll/Gueron are simply doing the obvious thing of processing four blocks of the message at a time. This works great if the number of message blocks is 0 mod 4, but the Goll/Gueron implementation needs to use a regular non-parallel implementation for any trailing 1-3 blocks.

2019/842 improves on this by observing that you can prepend 1-3 blocks that correspond to the zero element of the field, and the Poly1305 output will be unchanged while allowing the entire message to be processed 4 blocks at a time. I'm not sure how compatible this is with streaming implementations, as I'm pretty sure that zero blocks can only be prepended at the beginning (i.e. this observation only applies to known-length messages). 2019/842 then applies another optimisation when number of message blocks is 1 or 2 mod 4, which might be making their implementation more complex than it needs to be.

I suspect the best place to start is either floodyberry/poly1305-opt, or starting from the current software implementation and then applying the vector operations in the straightforward manner.

@str4d
Copy link
Owner Author

str4d commented Jan 21, 2020

I'm having a go at porting Goll-Gueron from scratch instead of via corrode, so I can both retain the original's structure (which is implemented via preprocessor macros and thus is lost during corrosion), and document as much of it as I can.

@str4d
Copy link
Owner Author

str4d commented Jan 25, 2020

poly1305 AVX2 support is taking longer than I want to leave this PR open for 😅 That can be included in a future dependency update.

@str4d str4d merged commit eee96f4 into master Jan 25, 2020
@str4d str4d deleted the 57-performance-improvements branch January 25, 2020 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants