Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect hash impl #72

Closed
conradludgate opened this issue Dec 21, 2024 · 6 comments · Fixed by #73
Closed

incorrect hash impl #72

conradludgate opened this issue Dec 21, 2024 · 6 comments · Fixed by #73

Comments

@conradludgate
Copy link

https://doc.rust-lang.org/std/hash/trait.Hash.html#hash-and-eq

When implementing both Hash and Eq, it is important that the following property holds:
k1 == k2 -> hash(k1) == hash(k2)
In other words, if two keys are equal, their hashes must also be equal

UniCase<S: AsRef<str>> implements Hash + Eq, so the above requirement must be satisfied. The following test case fails:

use std::hash::BuildHasher;
use foldhash::quality::RandomState;
use unicase::UniCase;

let k1 = UniCase::unicode("Foo");
let k2 = UniCase::new("foO");

assert_eq!(k1, k2);

let h = RandomState::default();
assert_eq!(h.hash_one(k1), h.hash_one(k2));

This occurs because UniCase::unicode produces b"f", b"o", b"o", while UniCase::new (with ascii) produces b'f', b'o', b'o', which is not guaranteed to be the same under Hasher.

@conradludgate
Copy link
Author

conradludgate commented Dec 21, 2024

I discovered this while trying to move phf to use foldhash instead of siphash. The siphasher crate happened to fold write/write_u8 calls, while foldhash does not, which caused unexpected test failures.

rust-phf/rust-phf#321

@seanmonstar
Copy link
Owner

Hm, is that actually Unicode that is wrong? Is it not FoldHash that might be doing it wrong, if write_u8 four times is different than write(four_bytes)?

We could unroll the call here in this crate, but it does seem surprising that that's considered different in terms of hash output.

@conradludgate
Copy link
Author

https://doc.rust-lang.org/std/hash/trait.Hasher.html

This trait provides no guarantees about how the various write_* methods are defined and implementations of Hash should not assume that they work one way or another. You cannot assume, for example, that a write_u32 call is equivalent to four calls of write_u8. Nor can you assume that adjacent write calls are merged

@conradludgate
Copy link
Author

We could unroll the call here in this crate, but it does seem surprising that that's considered different in terms of hash output.

I was thinking the other way around, making Ascii turn write_u8(b) into write(std::slice::from_ref(&b)) but I don't know for certain if that is sufficient.

@orlp
Copy link

orlp commented Dec 23, 2024

The fastest compatible implementation for both would be to turn the strings into 4-byte codepoints and hashing the stream of codepoints using write_u32. It's still a bit sad in both cases though.

@seanmonstar
Copy link
Owner

That sounds undesirable to me, because many characters are just 1 byte, perhaps 2, but that would require hashing 4 for every character. I'm thinking of just adding a loop on the bytes instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants