incorrect hash impl #72

conradludgate · 2024-12-21T07:03:22Z

https://doc.rust-lang.org/std/hash/trait.Hash.html#hash-and-eq

When implementing both Hash and Eq, it is important that the following property holds:
k1 == k2 -> hash(k1) == hash(k2)
In other words, if two keys are equal, their hashes must also be equal

UniCase<S: AsRef<str>> implements Hash + Eq, so the above requirement must be satisfied. The following test case fails:

use std::hash::BuildHasher;
use foldhash::quality::RandomState;
use unicase::UniCase;

let k1 = UniCase::unicode("Foo");
let k2 = UniCase::new("foO");

assert_eq!(k1, k2);

let h = RandomState::default();
assert_eq!(h.hash_one(k1), h.hash_one(k2));

This occurs because UniCase::unicode produces b"f", b"o", b"o", while UniCase::new (with ascii) produces b'f', b'o', b'o', which is not guaranteed to be the same under Hasher.

The text was updated successfully, but these errors were encountered:

conradludgate · 2024-12-21T07:06:26Z

I discovered this while trying to move phf to use foldhash instead of siphash. The siphasher crate happened to fold write/write_u8 calls, while foldhash does not, which caused unexpected test failures.

rust-phf/rust-phf#321

seanmonstar · 2024-12-23T16:09:38Z

Hm, is that actually Unicode that is wrong? Is it not FoldHash that might be doing it wrong, if write_u8 four times is different than write(four_bytes)?

We could unroll the call here in this crate, but it does seem surprising that that's considered different in terms of hash output.

conradludgate · 2024-12-23T17:15:38Z

https://doc.rust-lang.org/std/hash/trait.Hasher.html

This trait provides no guarantees about how the various write_* methods are defined and implementations of Hash should not assume that they work one way or another. You cannot assume, for example, that a write_u32 call is equivalent to four calls of write_u8. Nor can you assume that adjacent write calls are merged

conradludgate · 2024-12-23T18:02:39Z

We could unroll the call here in this crate, but it does seem surprising that that's considered different in terms of hash output.

I was thinking the other way around, making Ascii turn write_u8(b) into write(std::slice::from_ref(&b)) but I don't know for certain if that is sufficient.

orlp · 2024-12-23T19:03:07Z

The fastest compatible implementation for both would be to turn the strings into 4-byte codepoints and hashing the stream of codepoints using write_u32. It's still a bit sad in both cases though.

seanmonstar · 2024-12-23T19:29:44Z

That sounds undesirable to me, because many characters are just 1 byte, perhaps 2, but that would require hashing 4 for every character. I'm thinking of just adding a loop on the bytes instead.

seanmonstar mentioned this issue Dec 23, 2024

fix: hash for Unicode will call write_u8 like Ascii does #73

Merged

seanmonstar closed this as completed in #73 Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect hash impl #72

incorrect hash impl #72

conradludgate commented Dec 21, 2024

conradludgate commented Dec 21, 2024 •

edited

Loading

seanmonstar commented Dec 23, 2024

conradludgate commented Dec 23, 2024

conradludgate commented Dec 23, 2024

orlp commented Dec 23, 2024

seanmonstar commented Dec 23, 2024

incorrect hash impl #72

incorrect hash impl #72

Comments

conradludgate commented Dec 21, 2024

conradludgate commented Dec 21, 2024 • edited Loading

seanmonstar commented Dec 23, 2024

conradludgate commented Dec 23, 2024

conradludgate commented Dec 23, 2024

orlp commented Dec 23, 2024

seanmonstar commented Dec 23, 2024

conradludgate commented Dec 21, 2024 •

edited

Loading