use c2-chacha for stable, runtime-dispatched SIMD #789

kazcw · 2019-04-30T18:29:42Z

Performance:
gen_bytes_chacha20: 254 MB/s -> 603 MB/s [on a Xeon L5630 (SSE4.1)]

Minor version bump:
the only breaking change is that no-std builds now require
default-features=false (std is required by default for runtime cpu
detection; no_std builds will use the best implementation supported by
the target-features/target-cpu enabled at compile time)

New functionality:
ChaChaXRng is parameterized by round count at compile time. Convenient
aliases for the typical 20/12/8 round implementations exposed. ChaChaRng
is aliased to ChaCha20Rng for backward compatibility.

Closes #667

Performance: gen_bytes_chacha20: 254 MB/s -> 603 MB/s [on a Xeon L5630 (SSE4.1)] Minor version bump: the only breaking change is that no-std builds now require default-features=false (std is required by default for runtime cpu detection; no_std builds will use the best implementation supported by the target-features/target-cpu enabled at compile time) New functionality: ChaChaXRng is parameterized by round count at compile time. Convenient aliases for the typical 20/12/8 round implementations exposed. ChaChaRng is aliased to ChaCha20Rng for backward compatibility. Closes rust-random#667

More familiar numbers; evenness requirement enforced by PartialDiv.

dhardy · 2019-04-30T20:12:46Z

Xeon 1231 aka Haswell:

# from
test gen_bytes_chacha20             ... bench:   2,140,914 ns/iter (+/- 85,416) = 478 MB/s
test gen_u32_chacha20               ... bench:       7,330 ns/iter (+/- 128) = 545 MB/s
test gen_u64_chacha20               ... bench:      14,959 ns/iter (+/- 1,032) = 534 MB/s
test init_chacha                    ... bench:          23 ns/iter (+/- 1)
# to
test gen_bytes_chacha8              ... bench:     442,658 ns/iter (+/- 14,419) = 2313 MB/s
test gen_bytes_chacha12             ... bench:     592,790 ns/iter (+/- 13,809) = 1727 MB/s
test gen_bytes_chacha20             ... bench:     891,697 ns/iter (+/- 43,009) = 1148 MB/s
test gen_u32_chacha8                ... bench:       2,153 ns/iter (+/- 67) = 1857 MB/s
test gen_u32_chacha12               ... bench:       2,740 ns/iter (+/- 189) = 1459 MB/s
test gen_u32_chacha20               ... bench:       3,917 ns/iter (+/- 196) = 1021 MB/s
test gen_u64_chacha8                ... bench:       4,676 ns/iter (+/- 165) = 1710 MB/s
test gen_u64_chacha12               ... bench:       5,893 ns/iter (+/- 288) = 1357 MB/s
test gen_u64_chacha20               ... bench:       8,225 ns/iter (+/- 240) = 972 MB/s
test init_chacha                    ... bench:          45 ns/iter (+/- 2)
# compare
test gen_bytes_hc128                ... bench:     455,679 ns/iter (+/- 28,226) = 2247 MB/s
test gen_u32_hc128                  ... bench:       1,916 ns/iter (+/- 49) = 2087 MB/s
test gen_u64_hc128                  ... bench:       3,788 ns/iter (+/- 391) = 2111 MB/s
test init_hc128                     ... bench:       4,659 ns/iter (+/- 177)

Certainly a nice improvement!

Any thoughts @kazcw on which should be our default for StdRng? From what I remember ChaCha8 is potentially marginal but without known exploits, while others have significant margins.

BTW I pushed additional benchmark code to my fork.

dhardy · 2019-05-02T07:36:24Z

@burdges would you be able to review this?

vks · 2019-05-02T11:51:44Z

Those are some very nice performance improvements! HC128 being faster than ChaCha20 and ChaCha12 while being slower than ChaCha8 is consistent with the eBACS benchmarks. However, it seems like the speed difference between ChaCha20 and HC128 is less than in our case, so I wonder whether there is still some potential for optimization.

rand_chacha/src/lib.rs

dhardy

Looks good.

One concern is the number of crates this depends on (c2-chacha, byteorder, generic-array, typenum, ppv-lite86). We've had a few complaints that Rand pulls in too many dependencies; if we now use this for StdRng then we won't be reducing the number of dependencies as previously promised. This is not a blocking issue but we should consider whether there are good alternatives.

rand_chacha/src/chacha.rs

dhardy · 2019-05-03T09:15:03Z

HC128 being faster than ChaCha20

Not really; ignoring initialisation time ChaCha20 is more than twice as fast as HC128 on several of those benchmarks.

vks · 2019-05-03T10:35:45Z

Not really; ignoring initialisation time ChaCha20 is more than twice as fast as HC128 on several of those benchmarks.

Did you mean including initialization time?

burdges · 2019-05-03T15:14:18Z

IETF ChaCha has an initialization phase, but not ChaCha20. I'm not sure if I'm the ideal person to review this really.

dhardy · 2019-05-03T15:32:07Z

Who is the ideal person to review this then? You don't have to worry so much about correctness since we have test vectors; it's more about the API, approach and style.

kazcw · 2019-05-03T16:41:54Z

@dhardy Although the security margin of ChaCha12 is thought to be solid, AFAIK no one is currently using ChaCha with less than 20 rounds for anything important. My inclination would be to default to ChaCha20, to avoid being on the forefront of trusting anything out-of-the-box.

@vks There's room for more performance improvement: the AVX2 backend of ppv-lite86 is currently disabled, so right now only machine features up through SSE4.1 are in use. Once I get that code updated to the new ppv-lite API, performance should be within spitting distance of hand-optimized assembly for this algo.

@dhardy I can cut down the dependencies. I'll see what I can do.

Eliminated all transitive deps for rand_chacha except c2-chacha (common impl for rand and stream-cipher) and ppv-lite86 (SIMD implementation).

kazcw · 2019-05-03T18:13:35Z

@dhardy As of 90adb48 I eliminated all (transitive) dependencies except c2-chacha (common impl for rand and stream-cipher) and ppv-lite86 (SIMD library).

miri can't handle libstd's CPU detection. I presume it can't do the SIMD intrinsics either but I'll address that in ppv-lite.

I don't know if any user would want this but it's needed for miri

dhardy · 2019-05-03T20:47:33Z

Thanks @kazcw.

Rand is in the interesting position of trying to provide a "general purpose generator", and because of that we don't necessarily want to make the same compromises as others. I wish we could allow configuration by the end user, but the only option we have for that are feature-flags, which are not ideal for this kind of configuration. (I suppose we could have several flags like stdrng_chacha20 and use the most secure of all enabled feature flags, or the most secure option if no flag is used.)

burdges · 2019-05-04T12:01:21Z

Interface looks fine. It's basically the same, no? It's fine using distinct types for round count.

I avoid dependencies on RustCrypto whenever possible myself, but maybe not sensible here. Ideally rust would've implicit feature flags other crates in the same build, so RustCrypto would only become a dependency if it was a dependency anyways.

I have not dug into the c2_chacha code.

kazcw · 2019-05-04T19:37:47Z

avx2 has landed in the SIMD backend, so that should make your fancy-schmancy Haswells happy 😄. Benchmark comparison from a GCE instance:

# current master:
test gen_bytes_chacha20 ... bench: 2,591,248 ns/iter (+/- 101,109) = 395 MB/s

# c2-chacha branch, with ppv-lite86 0.2.2:
test gen_bytes_chacha20 ... bench: 1,273,655 ns/iter (+/- 26,588) = 803 MB/s

# c2-chacha branch, with ppv-lite86 0.2.3:
test gen_bytes_chacha20 ... bench: 752,480 ns/iter (+/- 25,401) = 1360 MB/s

dhardy · 2019-05-05T08:39:02Z

Certianly looks good in the micro-benches 😄

test gen_bytes_chacha12             ... bench:     321,606 ns/iter (+/- 4,552) = 3184 MB/s
test gen_bytes_chacha20             ... bench:     491,814 ns/iter (+/- 17,275) = 2082 MB/s
test gen_bytes_chacha8              ... bench:     237,518 ns/iter (+/- 6,924) = 4311 MB/s

We ought to test non-Intel CPUs too; I guess ChaCha20 is a lot slower on ARMv7?

Anyway, I think this PR is ready now? Thanks for all the contributions @kazcw!

dhardy · 2019-05-05T09:03:03Z

Dependencies are now down to autocfg (used by several Rand crates), c2-chacha, lazy_static, ppv-lite86 and rand_core, which I think is acceptable.

vks · 2019-05-06T08:47:01Z

rand_chacha/Cargo.toml


 [build-dependencies]
 autocfg = "0.1"

 [features]
-default = ["std"]
+default = ["std", "simd"]


This is not a breaking change, because the simd feature does not change the API, right?

Right, disabling simd just forces use of the same portable implementation as architectures that don't have SIMD backends.

kazcw added 5 commits April 30, 2019 11:27

impl SeedableRng for ChaCha*Core

6d7acb3

parameterize ChaChaXRng by rounds, not double-rounds

6df5578

More familiar numbers; evenness requirement enforced by PartialDiv.

update doc to use ChaCha with explicit round count

bc95e19

docs: ref base chacha types (aliases can't be linked)

1999aaf

Add benchmarks for new chacha variants

29da6f9

vks reviewed May 2, 2019

View reviewed changes

rand_chacha/src/lib.rs Outdated Show resolved Hide resolved

dhardy reviewed May 3, 2019

View reviewed changes

rand_chacha/src/chacha.rs Outdated Show resolved Hide resolved

kazcw added 4 commits May 3, 2019 09:45

restore copyright notice

a9a4930

remove interface for custom round counts

0aaf63b

impl without direct use of generic_array/typenum

933d383

use updated c2-chacha

90adb48

Eliminated all transitive deps for rand_chacha except c2-chacha (common impl for rand and stream-cipher) and ppv-lite86 (SIMD implementation).

kazcw added 3 commits May 3, 2019 11:20

update doc links now that ChaCha impls aren't aliases

0cf7f45

rand_chacha under miri: no std

35d4f35

miri can't handle libstd's CPU detection. I presume it can't do the SIMD intrinsics either but I'll address that in ppv-lite.

enable build without SIMD

965b21e

I don't know if any user would want this but it's needed for miri

vks approved these changes May 6, 2019

View reviewed changes

dhardy merged commit 65b8198 into rust-random:master May 7, 2019

dhardy mentioned this pull request May 7, 2019

Use ChaCha20 in StdRng and feature-gate SmallRng #792

Merged

This was referenced May 15, 2019

Tracker: Rand 0.7 #715

Closed

Fix Miri testing of ChaCha #796

Closed

dhardy mentioned this pull request Aug 28, 2019

rand_chacha and reducing dependencies #872

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use c2-chacha for stable, runtime-dispatched SIMD #789

use c2-chacha for stable, runtime-dispatched SIMD #789

kazcw commented Apr 30, 2019

dhardy commented Apr 30, 2019 •

edited

Loading

dhardy commented May 2, 2019

vks commented May 2, 2019 •

edited

Loading

dhardy left a comment

dhardy commented May 3, 2019

vks commented May 3, 2019

burdges commented May 3, 2019

dhardy commented May 3, 2019

kazcw commented May 3, 2019

kazcw commented May 3, 2019

dhardy commented May 3, 2019

burdges commented May 4, 2019 •

edited

Loading

kazcw commented May 4, 2019 •

edited

Loading

dhardy commented May 5, 2019 •

edited

Loading

dhardy commented May 5, 2019

vks May 6, 2019

kazcw May 6, 2019

use c2-chacha for stable, runtime-dispatched SIMD #789

use c2-chacha for stable, runtime-dispatched SIMD #789

Conversation

kazcw commented Apr 30, 2019

dhardy commented Apr 30, 2019 • edited Loading

dhardy commented May 2, 2019

vks commented May 2, 2019 • edited Loading

dhardy left a comment

Choose a reason for hiding this comment

dhardy commented May 3, 2019

vks commented May 3, 2019

burdges commented May 3, 2019

dhardy commented May 3, 2019

kazcw commented May 3, 2019

kazcw commented May 3, 2019

dhardy commented May 3, 2019

burdges commented May 4, 2019 • edited Loading

kazcw commented May 4, 2019 • edited Loading

dhardy commented May 5, 2019 • edited Loading

dhardy commented May 5, 2019

vks May 6, 2019

Choose a reason for hiding this comment

kazcw May 6, 2019

Choose a reason for hiding this comment

dhardy commented Apr 30, 2019 •

edited

Loading

vks commented May 2, 2019 •

edited

Loading

burdges commented May 4, 2019 •

edited

Loading

kazcw commented May 4, 2019 •

edited

Loading

dhardy commented May 5, 2019 •

edited

Loading