-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-34751: improved hash function for tuples #9471
Conversation
cdb33e0
to
9d2f8f6
Compare
3db0a4f
to
60936b7
Compare
7d446be
to
1d9c690
Compare
Include/pyhash.h
Outdated
#define _PyHASH_MULTIPLIER ((Py_uhash_t)1000003UL) | ||
|
||
/* Official constants from the xxHash specification */ | ||
#if SIZEOF_VOID_P > 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite at least one confused comment in the current code, the size of a hash code has no logical relationship to the size of a pointer. We're doing arithmetic on ints of type Py_uhash_t
, so the obvious thing to test instead is SIZEOF_PY_UHASH_T
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I didn't know that SIZEOF_PY_UHASH_T
existed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I used SIZEOF_VOID_P
because I copied that from here:
#if SIZEOF_VOID_P >= 8
# define _PyHASH_BITS 61
#else
# define _PyHASH_BITS 31
#endif
Should I also change that to use SIZEOF_PY_UHASH_T
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That certainly shouldn't be changed in this patch. If it's a problem, it should be addressed in a different patch. I don't know whether it is or isn't confused - determining that would require staring at the code using the weirdly named _PyHASH_BITS
(weirdly named because it obviously isn't intended to be the number of bits in a Python hash code - else the constants would be 64 instead of 61, and 32 instead of 31).
Include/pyhash.h
Outdated
/* Prime multiplier used in string and various other hashes. */ | ||
#define _PyHASH_MULTIPLIER 1000003UL /* 0xf4243 */ | ||
/* Constant used for various hashes */ | ||
#define _PyHASH_MULTIPLIER ((Py_uhash_t)1000003UL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this? The tuple hash no longer uses it at all, so there's no need to change it. I have no idea whether casting it to Py_uhash_t
makes sense in places that still do use it, and don't want to bother searching for them. "Not broke, don't fix."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I guess this was just a left-over from earlier changes.
Objects/tupleobject.c
Outdated
while (--len >= 0) { | ||
y = PyObject_Hash(*p++); | ||
if (y == -1) | ||
Py_uhash_t ERR = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is akin to, e.g., `int TWO = 2;" 😉
-1
is hardcoded all over almost all API functions that return integers to mean "error". Giving it a symbolic name instead suggests this function uses some other convention - which it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mainly to avoid the cast: ERR
is a lot shorter to write than (Py_uhash_t)-1
. Of course, this is a pure bikeshedding issue so I'll happily write (Py_uhash_t)-1
if you insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is read far more often than it's written. (Py_uhash_t)-1
is wholly explicit, obvious, and self-contained.
Objects/tupleobject.c
Outdated
acc += lane * _PyHASH_XXPRIME_2; | ||
/* Rotate acc left by 13 or 31 bits */ | ||
acc = (acc << (sizeof(acc) <= 4 ? 13 : 31)) + | ||
(acc >> (sizeof(acc) <= 4 ? 19 : 33)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think it would be better to #define a "rotate" macro at the place that already does different things depending on the size of Py_uhash_t
, with constants hard-coded appropriate for the size in use. The simpler the code, the more likely a compiler will emit a single rotate instruction (e.g., I checked that Visual Studio compiled my spelling to a single rotate in the code I posted on BPO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out Visual Studio does not generate a rotate instruction for this, but for a different reason: using "+" to combine the pieces. Compilers do dumb pattern matching for stuff like this, and "everyone" writes a rotate by hand using "|" instead. Use "|" instead of "+", and it does generate a single rotate instruction.
@@ -0,0 +1,3 @@ | |||
The hash function for tuples is now based on xxHash. | |||
This makes hash collisions less likely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest instead "This makes a high rate of collisions much less likely in certain rare cases on all platforms, and on 64-bit platforms improves tuple hashes in general (the old algorithm was designed for 32-bit hash codes)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why you think that it's a 32-bit vs. 64-bit issue. The statement "the old algorithm was designed for 32-bit hash codes" is probably true. However, that's not what is causing the problems here.
Maybe I could mention that we now have different algorithms for 32-bit and 64-bit systems, that might be relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you pointed out on BPO, the multiplier was "too small" for 64-bit boxes before. For that reason, the new algorithm (which doesn't have that issue) can be expected to do better on 64-bit boxes in general, quite apart from the handful of very bad special all-platform cases we identified. That's potentially of real interest to users. That the new algorithm is merely somewhat different between its own 32- and 64-bit versions is an implementation detail of no interest to the vast bulk of users.
Branch updated, please review... |
Misc/NEWS.d/next/Core and Builtins/2018-09-20-15-41-58.bpo-34751.Yiv0pV.rst
Outdated
Show resolved
Hide resolved
Include/pyhash.h
Outdated
/* Multiplier used for various hashes */ | ||
#define _PyHASH_MULTIPLIER 1000003UL | ||
|
||
/* Official constants from the xxHash specification */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest adding:
/* Optimizing compilers should emit a single "rotate" instruction for the _PyHASH_XXROTATE
* expansion. If that doesn't happen for some important platform, the macro could be changed
* to expand to a platform-specific rotate spelling instead.
*/
Can we try to finish this please? We spent a lot of time discussing the implementation and I think that we agree now. If there are any further changes which should be made here, just let me know. |
I would like to finish this too. I've been waiting for Raymond to chime in - but if he won't, we still need some other core dev to step in. While I have "the commit bit", I've never used it, and have the most minimal understanding of how the Github workflow is intended to work. That's why I kept posting code on the BPO report instead 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the pyhash.h code into tupleobject.c so that this hash is self-contained and doesn't leak of to tuple object code.
non-cryptographic hash: | ||
- we do not use any parallellism, there is only 1 accumulator. | ||
- we drop the final mixing since this is just a permutation of the | ||
output space: it does not help against collisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not comfortable dropping the final permutation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Final mixing" here refers to the xxHash spec's post-loop avalanche code, which is one long serialized critical path spanning 8 instructions including 2 multiplies. I don't believe that's what you (Raymond) have in mind here at all.
If you're talking about merely adding a large constant, ya, that can't hurt (except to add a cycle to the critical path), but xxHash doesn't do it, and I'm at a loss to dream up "a reason" for why it might do any good.
One of my goals here is to make it as brainless as possible to replace this code with the guts of xxHash version 2, if and when someone in real life finds a horrible case that's added to the SMHasher test suite. So I would really like to see a "good reason" to deviate at all from the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the final permutation, a single addend would suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't identify the purpose of doing this, how could we know whether it suffices? Best I can tell, you believe it helped with nested tuples in some other function, but one of radically different structure. In this function, the result of any nested tuple's hash is immediately multiplied by _PyHASH_XXPRIME_2
inside the loop, which is a far more disruptive permutation than adding a constant.
What it does as-is is powerful enough that we don't have any hint of a problem in either of the intensely "nested tuple" tests we have now. I'd be astonished if adding a constant hurt the test results - but even more astonished if it helped. I really don't want to add a line of code for which the only comment I could write is:
/* This isn't part of the xxHash spec, and removing it has
* no effect on any test we have, nor can we identify a
* reason for why it's here. DO NOT CHANGE!
*/
😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reasons I can't fully defend, I think the final addend is important and I'm not comfortable dropping that final step out of the tuplehash. The original xxhash has a complex permutation step, but I think and added will suffice to amplify effects between levels of nesting. It costs one clock, is possibly beneficial, and is utterly harmless. Unless you're dead set against it, let's not battle over this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find it again, but I left a comment somewhere saying that making that change (predictably) made no significant difference in any of the tests. Just seemingly randomly made insignificant differences in the number of collisions in about 20% of the tests. So keeping it in is OK by me, but you come up with a comment to explain it ;-)
Explained before why xxHash does what it does: it's striving for avalanche perfection, and there aren't enough permutations inside the loop for each bit of the last input or two to have a decent chance of affecting all the other bits in the full-width hash code. So they pile up more permutations outside the loop, specifically designed for their avalanche properties.
But they're permutations: two full-width hash codes collide after their avalanche code if and only if they collided before their avalanche code. We do care about collisions, but don't care about avalanche perfection, so their avalanche code serves no purpose we care about.
Merely adding a constant would have done nothing to help xxHash meet its avalanche goals; it's quite clear to me why their post-loop code is as long-winded as it is.
Misc/NEWS.d/next/Core and Builtins/2018-09-20-15-41-58.bpo-34751.Yiv0pV.rst
Outdated
Show resolved
Hide resolved
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
This is maddening - I make comments, and then they vanish, and then later they show up again - and then 6 copies show up - and then they vanish again. Similarly for Raymond's comments. I'm giving up for tonight |
00eeca4
to
8440ccb
Compare
Could this be related to https://blog.github.com/2018-10-21-october21-incident-report/ |
I have made the requested changes; please review again. |
Thanks for making the requested changes! @rhettinger: please review the changes made to this pull request. |
Objects/tupleobject.c
Outdated
acc += len ^ (_PyHASH_XXPRIME_5 ^ 3527539); | ||
|
||
if (acc == (Py_uhash_t)-1) { | ||
return -2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, we could return a more interesting value than -2, something random and large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how returning a different value than -2 would help anything. If the hash function is sufficiently random (which we assume it is), then both the values -1 and -2 should be very rare. So replacing -1 by -2 increases collisions by a very very tiny amount. But replacing -1 by, say 1546275796, would increase the number of collisions by the same very very tiny amount.
And then I have to invent yet another random constant which I want to avoid if possible. It would just be additional complication without any gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you insist and come up with a concrete proposal for the random value, I'll make the change anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the hash function is sufficiently random (which we assume it is)
But we don't. In particular, we know for sure that hash(-1) == hash(-2) == -2
, and hashes of little integers are common as dirt. While I doubt we'd ever be able to measure the difference, on the face of it it's attractive not to systematically create even more cases that always return -2.
If you insist and come up with a concrete proposal for the random value, I'll make the change anyway.
You already did! 😄 1546275796 is fine (far less likely than -2 to show up as a hash, and fits in 31 bits).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, we know for sure that
hash(-1) == hash(-2) == -2
, and hashes of little integers are common as dirt. While I doubt we'd ever be able to measure the difference, on the face of it it's attractive not to systematically create even more cases that always return -2.
OK, I see your point: it's not about avoiding collisions between hashes of tuples, it's about avoiding collision between the hash of a tuple and the hash of a non-tuple (like the integer -2
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done
It looks like we're getting close. Please move pyhash.h code into tupleobject.c |
Done! |
Overall, it looks good. Will double check it again later (I'm task saturated at the moment) and then apply. Nice work. |
This patch improves the hash code for tuples to avoid the structural hash collision
The new hash function is a simplified variant of xxHash
https://bugs.python.org/issue34751