-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple collisions in XXH3 #180
Comments
Knowing the default keys, it is expected to be trivial to reverse engineer and generate multiplications by 0. Nonetheless, for such case, the long term plan is to allow insertion of custom keys. Finally, suggestions for algorithm improvements are highly welcomed. It's the purpose of the test period to gather feedback and use them to improve the characteristics of the hash. |
What is the seed supposed to do then? Usually the idea of a seeded hash function is that you can provide some or all of the guarantees of a cryptographic hash function, provided that you keep the seed secret. But at the very least changing seed should be a way of rerolling collisions. As for contributing, the way I see it, the world doesn't need more hash functions, unless you can make a function that improve respectably upon the current best offerings you might as well not bother. I don't see a way of fixing this hash without severely reducing the speed, and what is the point then? |
^ Sure, you can change the magic numbers used for the hash, but that will lead to its own issues and you are basically rewriting the hash function. I had to rewrite my NEON implementation for this same reason, so I think fixing this isn't too big of a deal. |
int main()
{
char testbuffer[9] = { 0 };
long long int now = time(NULL);
*(unsigned int*)testbuffer = -0xb8fe6c39;
uint64_t lasthash = XXH3_64bits_withSeed(testbuffer, 8, now);
printf("%llu\n", lasthash);
long long i = 0;
for (int a = 0; a < 256; a++) {
testbuffer[4] = a;
for (int b = 0; b < 256; b++) {
testbuffer[5] = b;
for (int c = 0; c < 256; c++) {
testbuffer[6] = c;
for (int d = 0; d < 256; d++) {
testbuffer[7] = d;
for (int e = 0; e < 256; e++) {
testbuffer[8] = e;
if (lasthash != XXH3_64bits_withSeed(testbuffer, 8, now))
abort();
i++;
}
}
}
}
}
printf("%lli collisions\n", i);
}
… |
I guess an opened question is : should a non-cryptographic hash offer a guarantee of zero seed-independent collision ? I thought the definition of non-cryptographic implies there is no such guarantee. As mentioned by @easyaspi314 , it's simply a common (and therefore assumed) situation for non-cryptographic hashes. The seed so far is only intended to change the outcome of the hash function. This is useful when a user wants to quickly append a short piece of information (say, an identifier) to a content, and generate a hash value for the combination of both. An alternative is to append the identifier to the input, but it's not always possible (input is typically read only). Another solution is to pass both parts successively to a streaming API, but it loses a ton of performance in the process. Offering a guarantee of no seed-independent collision is a much higher level of protection than non-cryptographic. If it's a question of naming, I'm fine with swapping names around : |
Nah, that is a pretty basic expectancy. int main() {
uint64_t testbuffer[2]= {0};
int64_t now = time(NULL);
testbuffer[0] = -0x23a44bbeb8fe6c39;
uint64_t lasthash = XXH3_64bits_withSeed(testbuffer, 16, now);
uint64_t collisions = 0;
uint64_t i = ULLONG_MAX;
while (i-- > 0) {
testbuffer[1] = i;
uint64_t newhash = XXH3_64bits_withSeed(testbuffer, 16, now);
if (newhash != lasthash)
abort();
collisions++;
}
printf("%llu collisions\n", collisions);
} I can get
Back to the drawing board, definitely. Actually, it is so bad that Clang actually constexpr's the entire thing (GCC can only do it when the 128-bit multiply is replaced with the scalar version): main: ## @main
## %bb.0:
push rax
xor edi, edi
call time
lea rdi, [rip + L_.str]
mov rsi, -1
xor eax, eax
call printf
xor eax, eax
pop rcx
ret |
(I'm using my "clean reference" implementation I was about to upload here to make the algorithm clearer) Wouldn't this be better? static XXH64_hash_t
XXH3_len_4to8_64b(uint8_t const* const data, size_t const len, uint32_t const* const key, XXH64_hash_t const seed)
{
uint64_t acc = PRIME64_1 * ((uint64_t) len + seed);
uint32_t const dataL = XXH_read32(data);
uint32_t const dataH = XXH_read32(data + len - 4);
uint32_t const l1 = dataL + key[0];
uint32_t const l2 = dataH + key[1];
acc += (uint64_t) dataL | ((uint64_t) dataH << 32);
acc += (uint64_t) l1 * (uint64_t) l2;
return XXH3_avalanche(acc);
}
static XXH64_hash_t
XXH3_len_9to16_64b(uint8_t const* const data, size_t const len, uint32_t const* const key, XXH64_hash_t const seed)
{
uint64_t acc = PRIME64_1 * ((uint64_t) len + seed);
uint64_t const dataL = XXH_read64(data);
uint64_t const dataH = XXH_read64(data + len - 8);
uint64_t const ll1 = dataL + XXH3_readKey64(key);
uint64_t const ll2 = dataH + XXH3_readKey64(key + 2);
acc += dataL;
acc += dataH;
acc += XXH3_foldedMult128(ll1, ll2);
return XXH3_avalanche(acc);
} It's basically the same thing as |
I've no problem re-using the formula for long inputs onto short inputs too. I'm more concerned by the kind of confusion of objectives that I discern in this thread. In the above comments, I'm actually more concerned by the reduction in space created by a zero multiplication. Even then, this reduction is only detrimental in the 4-8 bytes range for 64-bit hash, and in the 9-16 bytes range for the 128-bit hash. After that point, a reduction is necessary by definition, so starting the reduction early or later doesn't change the outcome. That being said, in adopting the formula of long inputs for shorter ones, I could as well generalize it for any input length >= 4, just as a matter of consistency. Only the 1-3 range will be different, though I don't think it matters much (this one can't multiply by zero to begin with). |
Right, obviously, a non-cryptographic hash isn't intended to be perfect, but it should attempt to be decent. If more than That is why Ruby had to switch hash tables, because give the Hash that easily crafted data and boom, denial of service. It's a good thing we caught it now, tbh. |
In branch
The speed seems slightly slower in the 4-16 length range, but this is within my benchmark noise level, so it's pretty small. I'll need more measurements to ensure accuracy. I'm concentrating on the |
You still got 0 multiplication happening, and even without zeroes there is also a lot of different number pairs that produce the same result, the way you use multiplications is fundamentally unsound.
That likely just changes the number you have to aim for in order to produce the same effect, if you for instance do:
You can take |
This is not what happens in this case. |
In branch @easyaspi314 : on the NEON code path, some changes are needed.
These changes are pretty small in term of nb of lines, and insensitive for performance. |
I just checked, and your old hash function is also susceptible to seed-independent collisions, here is for instance 100 almost-guaranteed collisions:
|
This was already known. |
There are numerous issues all over the code.
Mainly it relies on multiplication between non-constant numbers, which in general is a bad idea, particularly it leads to spots of ignored input data when one of those numbers is 0, but it also just throws away data way too early.
The idea of using unrelated 64 bit accumulators does not fly for a 128 bit hash, if only one of the accumulators has its input changed random collisions will occur as often as with a 64 bit hash.
The seed is in most cases not applied properly, so it is easy to fabricate generic collisions that work regardless of the seed. It also does nothing to protect the input data. Seed and input is easily recoverable from hashes on small input.
Here is an example of collisions because of multiplication by 0 (code is untested because the build is broken):
The text was updated successfully, but these errors were encountered: