Performance improvements #58

str4d · 2020-01-17T14:08:10Z

The performance improvement comes from the upgraded chacha20 dependency. Performance is slightly improved by default, and compiling with RUSTFLAGS="-Ctarget-feature=+avx2" will boost performance significantly.

Part of #57.

codecov · 2020-01-20T05:33:34Z

Codecov Report

Merging #58 into master will increase coverage by 0.11%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   49.23%   49.35%   +0.11%     
==========================================
  Files          22       22              
  Lines        1694     1694              
==========================================
+ Hits          834      836       +2     
+ Misses        860      858       -2

Impacted Files	Coverage Δ
src/primitives/armor.rs	`78.57% <0%> (+0.79%)`	⬆️
src/openssh.rs	`52.23% <0%> (+1.49%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84e98a2...97ca7c2. Read the comment docs.

str4d · 2020-01-20T06:03:54Z

It would be nice if we could build binaries that automatically used the faster AVX2 implementation when available, but it looks like that would require a non-trivial probably-breaking refactor of the sha2 crate to detect support at runtime instead of compile-time. cc @tarcieri

For now, I'll update the README and figure out how I'd need to publish split binaries.

tarcieri · 2020-01-20T13:35:26Z

Runtime feature detection has come up a few times. It's something we'd like to add, but we'd also like to avoid calling CPUID in the middle of hot loops. We haven't found a satisfactory solution yet.

Related thread:

RustCrypto/block-ciphers#25

tarcieri · 2020-01-20T14:45:16Z

I also look at Poly1305 AVX2 this weekend. There's another Goll/Gueron paper about it, which is unfortunately behind an IEEE paywall:

https://ieeexplore.ieee.org/document/7113463

Probably the best bet is to adapt @floodyberry's poly1305-opt, which uses Intel SSE/AVX2 intrinsics:

Edit: made a tracking issue specifically for SIMD Poly1305: RustCrypto/universal-hashes#46

str4d · 2020-01-20T14:56:43Z

https://eprint.iacr.org/2019/842 appears to describe the Goll/Gueron technique, and then improves on it.

tarcieri · 2020-01-20T15:03:43Z

@str4d I took a quick look at the code linked from the paper here:

https://github.com/Sreyosi/Improved-SIMD-Implementation-of-Poly1305

Their technique looks both complex and the implementation incomplete, i.e. there are large swaths of important looking code commented out including everything related to buffering. (The UniversalHash trait doesn't necessarily need buffering, but the portable implementation currently provides it)

tarcieri · 2020-01-20T20:21:14Z

I attempted a mechanical translation of the 2019/842 eprint code using Corrode:

https://github.com/RustCrypto/universal-hashes/pull/47/files

I noticed a lot of the original code is commented out and likely incomplete.

If someone actually wants to do the work to debug it it might make for the basis of an optimized Poly1305 implementation, but I think it probably makes more sense to start with a perhaps less optimized but known-to-be-working implementation like floodyberry/poly1305-opt.

(that said, because floodyberry/poly1305-opt has sprinkles of inline assembly it is a less straightforward translation. also the current implementation depends on the total output length in advance but it may be possible to work around that)

str4d · 2020-01-20T22:20:29Z

I've now had a chance to look over the paper. It appears (from their description) that Goll/Gueron are simply doing the obvious thing of processing four blocks of the message at a time. This works great if the number of message blocks is 0 mod 4, but the Goll/Gueron implementation needs to use a regular non-parallel implementation for any trailing 1-3 blocks.

2019/842 improves on this by observing that you can prepend 1-3 blocks that correspond to the zero element of the field, and the Poly1305 output will be unchanged while allowing the entire message to be processed 4 blocks at a time. I'm not sure how compatible this is with streaming implementations, as I'm pretty sure that zero blocks can only be prepended at the beginning (i.e. this observation only applies to known-length messages). 2019/842 then applies another optimisation when number of message blocks is 1 or 2 mod 4, which might be making their implementation more complex than it needs to be.

I suspect the best place to start is either floodyberry/poly1305-opt, or starting from the current software implementation and then applying the vector operations in the straightforward manner.

str4d · 2020-01-21T16:03:37Z

I'm having a go at porting Goll-Gueron from scratch instead of via corrode, so I can both retain the original's structure (which is implemented via preprocessor macros and thus is lost during corrosion), and document as much of it as I can.

str4d · 2020-01-25T21:46:27Z

poly1305 AVX2 support is taking longer than I want to leave this PR open for 😅 That can be included in a future dependency update.

Add encryption benchmark

e78c6a2

str4d mentioned this pull request Jan 17, 2020

Rage is 38% slower at encrypting than Go implementation #57

Open

cargo update

97ca7c2

str4d force-pushed the 57-performance-improvements branch from 863cb4a to 97ca7c2 Compare January 20, 2020 05:42

str4d merged commit eee96f4 into master Jan 25, 2020

str4d deleted the 57-performance-improvements branch January 25, 2020 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements #58

Performance improvements #58

str4d commented Jan 17, 2020 •

edited

Loading

codecov bot commented Jan 20, 2020 •

edited

Loading

str4d commented Jan 20, 2020

tarcieri commented Jan 20, 2020

tarcieri commented Jan 20, 2020 •

edited

Loading

str4d commented Jan 20, 2020

tarcieri commented Jan 20, 2020 •

edited

Loading

tarcieri commented Jan 20, 2020 •

edited

Loading

str4d commented Jan 20, 2020

str4d commented Jan 21, 2020

str4d commented Jan 25, 2020

Performance improvements #58

Performance improvements #58

Conversation

str4d commented Jan 17, 2020 • edited Loading

codecov bot commented Jan 20, 2020 • edited Loading

Codecov Report

str4d commented Jan 20, 2020

tarcieri commented Jan 20, 2020

tarcieri commented Jan 20, 2020 • edited Loading

str4d commented Jan 20, 2020

tarcieri commented Jan 20, 2020 • edited Loading

tarcieri commented Jan 20, 2020 • edited Loading

str4d commented Jan 20, 2020

str4d commented Jan 21, 2020

str4d commented Jan 25, 2020

str4d commented Jan 17, 2020 •

edited

Loading

codecov bot commented Jan 20, 2020 •

edited

Loading

tarcieri commented Jan 20, 2020 •

edited

Loading

tarcieri commented Jan 20, 2020 •

edited

Loading

tarcieri commented Jan 20, 2020 •

edited

Loading