Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review which hash algorithm to use for incremental compilation hashes. #41215

Closed
michaelwoerister opened this issue Apr 11, 2017 · 42 comments
Closed
Labels
A-incr-comp Area: Incremental compilation C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC

Comments

@michaelwoerister
Copy link
Member

michaelwoerister commented Apr 11, 2017

For incremental compilation we are hashing lots of things, HIR, crate metadata and soon also type checking tables. These hashes are used purely as fingerprints, that is, we compare these hashes in order to find out if the things having been hashed are equal or not. Consequently the primary qualities we require from the hash algorithm are:

  1. That it acts as a pseudo-random-function, i.e. whatever we feed in, the resulting hashes should have a collision probability that is roughly the same as that of random numbers.
  2. That it produces wide enough hash values so the collision probabilities are low enough (128 bits should be fine).
  3. That it is fast.

It remains to be clarified whether it makes sense to use a cryptographic hash function, i.e. that it is hard to find/construct collisions. My guess is no:

  • Incremental compilation caches are not meant to be distributed. They are platform dependent files created locally by your compiler and there is little use in distributing them.
  • A 128 bit hash is probably susceptible to a brute force attack anyway.
  • I cannot think of an attack that exploits a hash collision here. If one wants to tamper with an incremental compilation cache, one could just replace the object file in the cache without a need to ever touch item hashes.

I'd be very interested though if someone could come up with an attack scenario. But even then the solution would probably be to implement some general code signing scheme for these caches.

At the moment we are using BLAKE2 as our hashing algorithm which is a cryptographic hash function and thus fulfills requirements (1) and (2) very well. However, compared to non-cryptographic hash functions it is rather slow. For raw throughput SipHash would be about three times as fast and fast hash functions like Metrohash can be 20 times as fast. Before you get too excited though: Most of the time computing hashes is spent gathering the data that is fed into the hash function, not actually running the function on it. But there are still some gains to be had, here are some preliminary tests of using BLAKE2 -- which is our "worst case" -- and MetroHash -- which is among the fastest hash functions providing 128 bit hashes:

BLAKE2 MetroHash SipHash 1-3 SipHash 2-4
libcore 0.258s 0.193s (-25%) 0.189s (-26%) 0.201s (-22%)
librustc 0.435s 0.345s (-20%) 0.320s (-26%) 0.327s (-25%)
syntex_syntax (HIR) 0.221s 0.166s (-25%) 0.168s (-24%) 0.176s (-20%)
syntex_syntax (Metadata) 0.456s 0.326s (-28%) 0.312s (-31%) 0.347s (-24%)

So it seems that a fast hash functions allows to save 20-30% spent during ICH computation. If we could get some confidence that:

  • a function like MetroHash provides low enough collision probability and
  • we really don't need a cryptographic hash function

then we could have some free speedups here.

UPDATE:
I added the 128 bit version of SipHash to the table and the timings are as good as the ones for MetroHash.

@michaelwoerister michaelwoerister added the A-incr-comp Area: Incremental compilation label Apr 11, 2017
@est31
Copy link
Member

est31 commented Apr 11, 2017

As another idea, if non cryptographic hashes are allowed, we could use SHA-1 on platforms where its hardware accelerated (apparently Ryzen and newer and Goldmont and newer). SHA-1 is slightly slower than BLAKE2, but I guess with hardware acceleration it will be much faster, maybe faster than a MetroHash implemented with generic instructions. Would be interesting to have benchmarks on this.

@michaelwoerister
Copy link
Member Author

That's an interesting idea. However, I'd need to see actual numbers for that and they would have to be very compelling in order to justify adding platform-specific code.
Also, once we go down the platform-specific road: MetroHash can also make use of some special instructions, almost doubling its speed.

@michaelwoerister
Copy link
Member Author

A 128 bit SipHash13 might also be a good contender. It's pretty fast and it's designed to be a good PRF.

@arthurprs
Copy link
Contributor

arthurprs commented Apr 12, 2017

Siphash is probably a sensible choice here if the numbers look good. Metrohash is great but I think we should try to be conservative to some extent.

@michaelwoerister
Copy link
Member Author

@rust-lang/libs: There are no plans of making the 128-bit version of SipHash available in the standard library, right?

@sfackler
Copy link
Member

Not in a stable manner, no, but that shouldn't matter for compiler internals.

@est31
Copy link
Member

est31 commented Apr 18, 2017

Also, afaik its possible to simply use crates from crates.io now. Just add them to the Cargo.toml and commit both that change and the change to Cargo.lock into git.

@michaelwoerister
Copy link
Member Author

I added the 128 bit version of SipHash to the table and the timings are as good as the ones for MetroHash.

@arthurprs
Copy link
Contributor

Sounds like a clear winner to me. Is that 1-3 or 2-4 variant?

@michaelwoerister
Copy link
Member Author

It's 1-3 ...

@michaelwoerister
Copy link
Member Author

I added the SipHash 2-4 timings too. As expected they are slightly worse but still pretty good.

@briansmith
Copy link
Contributor

That it produces wide enough hash values so the collision probabilities are low enough (128 bits should be fine).

NIT: It doesn't make sense to measure hashes by the number of bits this way, because the number of bits doesn't correlate strongly enough with the desired collision probability. For example, `SHA-1(x)||"123456789012" is a 256-bit hash but it isn't nearly as strong as SHA-256.

Instead, it would be good to mention the desired collision probability. Also, what's the threat model? I.e. is there any way that something could go wrong if there's a collision? AFAICT a collision could easily be disasterous.

Also, I don't think the use of non-Rust implementations should be rejected since a huge amount of the compiler isn't written in Rust already. I would rather the compiler use a safer algorithm written in assembly language then a weaker algorithm that's implemented in Rust.

@michaelwoerister
Copy link
Member Author

michaelwoerister commented Apr 28, 2017

NIT: It doesn't make sense to measure hashes by the number of bits this way, because the number of bits doesn't correlate strongly enough with the desired collision probability. For example, `SHA-1(x)||"123456789012" is a 256-bit hash but it isn't nearly as strong as SHA-256.

I'm working under the assumption that a high-quality hash function will use all bits available.

Also, what's the threat model? I.e. is there any way that something could go wrong if there's a collision? AFAICT a collision could easily be disasterous.

A collision would result in some piece data not being detected as having changed. I would not talk about "threat model" though. Cache files are not meant to be distributed. An attacker would be better off just manipulating rlibs directly.

Also, I don't think the use of non-Rust implementations should be rejected since a huge amount of the compiler isn't written in Rust already. I would rather the compiler use a safer algorithm written in assembly language then a weaker algorithm that's implemented in Rust.

Nobody said it has to be written in Rust. But maintenance cost is a factor too.

@nagisa
Copy link
Member

nagisa commented May 18, 2017

SHA-256 based on the dedicated SHA instructions can do 1.7GB/s on Ryzen. Probably more for SHA-1, but I didn’t benchmark it yet. AVX2 implementation of Fletcher4 that ZFS uses for checksumming does 15GB/s on the same machine. That same Fletcher4 is approximately 2GB/s for plain-old C version.

(EDIT: of course Fletcher4 is good for checksumming and does not really work well for avoiding collisions)

@arthurprs
Copy link
Contributor

Any reason NOT to use siphash2-4 on this?

It's gonna give rustc 90% of the speed benefit and it'll leave little to no questions regarding maintaining platform specific code, hashing quality or security (the last is a weak argument, but it exists anyway).

@michaelwoerister
Copy link
Member Author

Any reason NOT to use siphash2-4 on this?

It seems like an improvement over what we have now (which does not preclude that there are still better options).
Leaving security aside, is there any reason NOT to use siphash1-3? Does hashing quality degrade with fewer rounds?

@arthurprs
Copy link
Contributor

Apparently there's a slight loss of quality on smhasher (avalanche test worse bias from 0.72% (2-4) to 0.9%(1-3)), not sure if it's significant.

source: https://github.com/rurban/smhasher

@michaelwoerister
Copy link
Member Author

@alexcrichton What's the state of using crates.io crates in the compiler. Is that something we can do easily now? I guess that we have to at least vendor each crate we are using, right?

@Mark-Simulacrum
Copy link
Member

As far as I know, crates within the compiler itself can just be depended upon and everything should just work. I think crates that are depended upon by libstd or it's dependencies can't do that yet, but I'm not sure.

@alexcrichton
Copy link
Member

@michaelwoerister as of #41847 we should be able to just depend on crates.io crates and have things like stability, vendoring, license checking, etc, all just fall out. Unfortunately really leveraging #41847 will require a new stage0 compiler which supports that flag.

@michaelwoerister
Copy link
Member Author

Thanks for the info @Mark-Simulacrum and @alexcrichton!

Unfortunately really leveraging #41847 will require a new stage0 compiler which supports that flag.

So this will be available when the next beta comes out in a couple of weeks? That's soon enough, there's no rush with this issue.

@alexcrichton
Copy link
Member

June 8!

@joshtriplett
Copy link
Member

Has anyone tested seahash? It's several times faster than SipHash and FNV, and has a native Rust implementation. Trying it should be easy given crates.io crate support.

This doesn't need a cryptographically strong hash, and seahash should beat even a hardware-accelerated SHA-256. (Seahash itself appears designed to use hardware-accelerated vectorization where available, while remaining portable.)

@michaelwoerister
Copy link
Member Author

Looks promising. @ticki, is there also a version of seahash that provides 128 bit outputs (or more)?

@est31
Copy link
Member

est31 commented Jun 8, 2017

One way to get 128 bits of output would be to do this in the finalize step instead of the "xor them all together":

a^=c;
c^=d;
b=a;
d=c;
a^=written;
helper::diffuse(a);
helper::diffuse(c);
a^=c;
c^=b;
helper::diffuse(a);
helper::diffuse(c);
return (a,c);

But maybe @ticki has a better idea.

@Mark-Simulacrum Mark-Simulacrum added the C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC label Jul 27, 2017
@arielb1
Copy link
Contributor

arielb1 commented Oct 5, 2017

If this is really correctness-critical, I like the cryptographic guarantees a cryptographic hash gives us and would hate to have correctness possibly be compromised by hash misbehavior.

@michaelwoerister
Copy link
Member Author

@arielb1 Would you be comfortable with using SipHash? It is performing very well too and since it is supposed to combat exploits that are based on hash collisions, the bar for it's output quality is pretty high. Also, I assume that it has had lots of eyes on it since it is used so widely.

@arthurprs
Copy link
Contributor

I'm sure this was answered somewhere already, but what happens if hashes collide?

@michaelwoerister
Copy link
Member Author

I'm sure this was answered somewhere already, but what happens if hashes collide?

Then the compiler would potentially not re-run some computation although it should. The consequences of that depends on the computation/hash in question. The absolute worst case would be that an object file gets re-used although it shouldn't and you silently end up with the previous machine code version of the functions in there instead of the current one.

@est31
Copy link
Member

est31 commented Oct 5, 2017

If this is really correctness-critical, I like the cryptographic guarantees a cryptographic hash gives us and would hate to have correctness possibly be compromised by hash misbehavior.

There are two cases here where collisions could occur theoretically:

One is security related where e.g. someone creates some code with a given hash and it now gets put into a different function as well that has the same hash, and this happens on purpose. It seems like this is very hard to pull this off. I guess many other ways of attacks are way easier (like creating a compiler plugin and then using unsafe to get access to compiler managed memory).

The second is as a general source of bugs. These collisions might in theory also happen if you are very unlucky. However, I doubt that you'd get "by chance" collisions given the 128 bits.

So generally I don't think the hash is correctness-critical.

@michaelwoerister
Copy link
Member Author

One is security related where e.g. someone creates some code with a given hash and it now gets put into a different function as well that has the same hash.

It's hard to imagine how that would work in practice. A potential attacker could try to inject some code that has the same hash as some other piece of code that is already in your codebase. But the effect of the hash collision would be that the injected piece of code would not get used since the difference between the two pieces of code is not detected :)

@nagisa
Copy link
Member

nagisa commented Oct 5, 2017 via email

@michaelwoerister
Copy link
Member Author

@nagisa I don't know of a use case where it makes sense to move an incr. comp. cache from one machine to another. I personally would prefer to have just one implementation to maintain but if a hardware accelerated SHA256 was significantly faster than SipHash, then in theory it could be done. Do you have numbers on how fast SHA256 can get?

@nagisa
Copy link
Member

nagisa commented Oct 5, 2017

@michaelwoerister I posted #41215 (comment) earlier in the thread. Comparison is against Fletcher4, which is not very useful here, so I ran a similar test using std::hash::SipHasher24 in an environment as close to what I remember I used for the initial test. SipHasher manages aproximately a 2.33GB/s, so SHA doesn’t seem to be faster.

@arthurprs
Copy link
Contributor

arthurprs commented Oct 5, 2017

I think 128bit SipHash is the way to go here, there are surely faster hashers but the returns are very small for this workload (MetroHash results sort of prove that). No amount of bits can guarantee no collisions 🤷‍♀️ and as far as I understand there's no need for crypto hash security anyway.

Also, I don't fell comfortable using MetroHash or (insert your fav. fast hash here) in places other than hashtables.

@arielb1
Copy link
Contributor

arielb1 commented Oct 5, 2017

@michaelwoerister

My worry was more that a specially-crafted piece of "code" (e.g. string data included from some outside source) could cause MetroHash to get into a bad state in which "particular" collisions are "easy". I can't think of a simple way to extend this into an exploit, but stranger things have happened.

I think that SipHash with a per-incremental-directory random key should be, if not something I'm 100% comfortable with, harder to exploit - exploiting it would require finding the SipHash key (which is not that easy) and crafting data based on that. While I can imagine some online situations in which that could be exploitable, this decreases the attack surface.

@arielb1
Copy link
Contributor

arielb1 commented Oct 5, 2017

From an academic cryptographer POV, I can hardly call this scheme cryptographically secure. I would consider it a nice opportunity to try and find a contrived way to "break this" if you control all the neighboring degrees of freedom.

Actually, all 128-bit schemes expose an attack with complexity about 2^64 - if there's a "data blob" an attacker can manipulate, he can "birthday attack" his way to a collision and cause some specific target to not be updated. I weakly suspect that sip2-4 has known-key collision attacks with better complexity (it wasn't that well analyzed for collision resistance). 2^64 is already "way too close to be comfortable" (e.g. the bitcoin network computes about that amount of hashes every second), especially 10 years from now.

From a practical standpoint, I think keeping the SIP key reasonably secret would make "generic" attacks more complicated while I still can't see a good way to get a "theoretical" level of security.

@joshtriplett
Copy link
Member

joshtriplett commented Oct 5, 2017

Would seahash make sense here? It's quite fast, and collision-resistant.

I'm curious about the microbenchmarks used to evaluate it relative to metrohash, given that seahash's own benchmarks seem to disagree with that result. Might be worth doublechecking that with the seahash and metrohash developers, both.

@michaelwoerister
Copy link
Member Author

From an academic cryptographer POV, I can hardly call this scheme cryptographically secure.

That's what I'm saying all along: None of this is supposed to be cryptographically secure. Using SipHash or even BLAKE2 would not change that.

@michaelwoerister
Copy link
Member Author

@joshtriplett This is the benchmark I used:

#![allow(non_snake_case)]
#![feature(test)]

extern crate test;
extern crate rand;

extern crate metrohash;
extern crate fasthash;

use test;
use rand::{self, Rng};
use metrohash::MetroHash128;
use fasthash::sea::SeaHasher64 as fasthash_SeaHasher64;

fn create_unique_values(count: usize) -> Vec<u8> {
    let mut rng = rand::thread_rng();
    let mut values = Vec::with_capacity(count);
    for _ in 0 .. count {
        values.push(rng.gen())
    }

    values
}

fn run_bench<H: Hasher+Default>(bytes: usize, bh: &mut test::Bencher) {
    let values = create_unique_values(bytes);
    bh.iter(|| {
        let mut hasher = H::default();
        values.hash(&mut hasher);
        test::black_box(hasher.finish());
    })
}

// This is roughly the average amount of data that we hash at a time in the Rust compiler  
const BYTE_COUNT: usize = 400;

#[bench]
fn metrohash128(bh: &mut test::Bencher) {
    run_bench::<MetroHash128>(BYTE_COUNT, bh);
}

#[bench]
fn fasthash_SeaHasher64(bh: &mut test::Bencher) {
    run_bench::<fasthash_SeaHasher64>(BYTE_COUNT, bh);
}

I get these results:

test bench::fasthash_SeaHasher64        ... bench:          67 ns/iter (+/- 3)
test bench::metrohash128                ... bench:          58 ns/iter (+/- 2)

TBH, I would be less comfortable with using SeaHash than with using MetroHash since SeaHash is even newer. There's no doubt that SeaHash is a great hash function but for this use case I prefer something more boring :)

@michaelwoerister
Copy link
Member Author

@nagisa Thanks for those numbers! Looks like SipHash is the better choice then.

bors added a commit that referenced this issue Oct 19, 2017
…sakis

incr.comp.: Use 128bit SipHash for fingerprinting

This PR switches incr. comp. result fingerprinting from 128 bit BLAKE2 to 128 bit SipHash. When we started using BLAKE2 for fingerprinting, the 128 bit version of SipHash was still experimental. Now that it isn't anymore we should be able to get a nice performance boost without significantly increasing collision probability.

~~I'm going to start a try-build for this, so we can gauge the performance impact before merging (hence the `WIP` in the title).~~

EDIT: Performance improvements look as expected. Tests seem to be passing.

Fixes #41215.
bors added a commit that referenced this issue Oct 20, 2017
…sakis

incr.comp.: Use 128bit SipHash for fingerprinting

This PR switches incr. comp. result fingerprinting from 128 bit BLAKE2 to 128 bit SipHash. When we started using BLAKE2 for fingerprinting, the 128 bit version of SipHash was still experimental. Now that it isn't anymore we should be able to get a nice performance boost without significantly increasing collision probability.

~~I'm going to start a try-build for this, so we can gauge the performance impact before merging (hence the `WIP` in the title).~~

EDIT: Performance improvements look as expected. Tests seem to be passing.

Fixes #41215.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-incr-comp Area: Incremental compilation C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC
Projects
None yet
Development

No branches or pull requests

10 participants