bpo-34751: improved hash function for tuples #9471

jdemeyer · 2018-09-21T12:50:55Z

This patch improves the hash code for tuples to avoid the structural hash collision

hash((3, 3)) == hash((-3, -3))

The new hash function is a simplified variant of xxHash

https://bugs.python.org/issue34751

Objects/tupleobject.c

tim-one · 2018-10-07T22:00:35Z

Include/pyhash.h

+#define _PyHASH_MULTIPLIER ((Py_uhash_t)1000003UL)
+
+/* Official constants from the xxHash specification */
+#if SIZEOF_VOID_P > 4


Despite at least one confused comment in the current code, the size of a hash code has no logical relationship to the size of a pointer. We're doing arithmetic on ints of type Py_uhash_t, so the obvious thing to test instead is SIZEOF_PY_UHASH_T.

Sure, I didn't know that SIZEOF_PY_UHASH_T existed.

By the way, I used SIZEOF_VOID_P because I copied that from here:

#if SIZEOF_VOID_P >= 8 # define _PyHASH_BITS 61 #else # define _PyHASH_BITS 31 #endif

Should I also change that to use SIZEOF_PY_UHASH_T?

That certainly shouldn't be changed in this patch. If it's a problem, it should be addressed in a different patch. I don't know whether it is or isn't confused - determining that would require staring at the code using the weirdly named _PyHASH_BITS (weirdly named because it obviously isn't intended to be the number of bits in a Python hash code - else the constants would be 64 instead of 61, and 32 instead of 31).

tim-one · 2018-10-07T22:03:22Z

Include/pyhash.h

-/* Prime multiplier used in string and various other hashes. */
-#define _PyHASH_MULTIPLIER 1000003UL  /* 0xf4243 */
+/* Constant used for various hashes */
+#define _PyHASH_MULTIPLIER ((Py_uhash_t)1000003UL)


Why change this? The tuple hash no longer uses it at all, so there's no need to change it. I have no idea whether casting it to Py_uhash_t makes sense in places that still do use it, and don't want to bother searching for them. "Not broke, don't fix."

Sure. I guess this was just a left-over from earlier changes.

tim-one · 2018-10-07T22:11:24Z

Objects/tupleobject.c

-    while (--len >= 0) {
-        y = PyObject_Hash(*p++);
-        if (y == -1)
+    Py_uhash_t ERR = -1;


This is akin to, e.g., `int TWO = 2;" 😉

-1 is hardcoded all over almost all API functions that return integers to mean "error". Giving it a symbolic name instead suggests this function uses some other convention - which it doesn't.

It's mainly to avoid the cast: ERR is a lot shorter to write than (Py_uhash_t)-1. Of course, this is a pure bikeshedding issue so I'll happily write (Py_uhash_t)-1 if you insist.

Code is read far more often than it's written. (Py_uhash_t)-1 is wholly explicit, obvious, and self-contained.

tim-one · 2018-10-07T22:15:53Z

Objects/tupleobject.c

+        acc += lane * _PyHASH_XXPRIME_2;
+        /* Rotate acc left by 13 or 31 bits */
+        acc = (acc << (sizeof(acc) <= 4 ? 13 : 31)) +
+              (acc >> (sizeof(acc) <= 4 ? 19 : 33));


Here I think it would be better to #define a "rotate" macro at the place that already does different things depending on the size of Py_uhash_t, with constants hard-coded appropriate for the size in use. The simpler the code, the more likely a compiler will emit a single rotate instruction (e.g., I checked that Visual Studio compiled my spelling to a single rotate in the code I posted on BPO).

Turns out Visual Studio does not generate a rotate instruction for this, but for a different reason: using "+" to combine the pieces. Compilers do dumb pattern matching for stuff like this, and "everyone" writes a rotate by hand using "|" instead. Use "|" instead of "+", and it does generate a single rotate instruction.

tim-one · 2018-10-10T02:00:18Z

Misc/NEWS.d/next/Core and Builtins/2018-09-20-15-41-58.bpo-34751.Yiv0pV.rst

@@ -0,0 +1,3 @@
+The hash function for tuples is now based on xxHash.
+This makes hash collisions less likely.


Suggest instead "This makes a high rate of collisions much less likely in certain rare cases on all platforms, and on 64-bit platforms improves tuple hashes in general (the old algorithm was designed for 32-bit hash codes)."

I don't see why you think that it's a 32-bit vs. 64-bit issue. The statement "the old algorithm was designed for 32-bit hash codes" is probably true. However, that's not what is causing the problems here.

Maybe I could mention that we now have different algorithms for 32-bit and 64-bit systems, that might be relevant.

As you pointed out on BPO, the multiplier was "too small" for 64-bit boxes before. For that reason, the new algorithm (which doesn't have that issue) can be expected to do better on 64-bit boxes in general, quite apart from the handful of very bad special all-platform cases we identified. That's potentially of real interest to users. That the new algorithm is merely somewhat different between its own 32- and 64-bit versions is an implementation detail of no interest to the vast bulk of users.

jdemeyer · 2018-10-11T12:22:36Z

Branch updated, please review...

Misc/NEWS.d/next/Core and Builtins/2018-09-20-15-41-58.bpo-34751.Yiv0pV.rst

tim-one · 2018-10-11T18:00:29Z

Include/pyhash.h

+/* Multiplier used for various hashes */
+#define _PyHASH_MULTIPLIER 1000003UL
+
+/* Official constants from the xxHash specification */


Suggest adding:

/* Optimizing compilers should emit a single "rotate" instruction for the _PyHASH_XXROTATE * expansion. If that doesn't happen for some important platform, the macro could be changed * to expand to a platform-specific rotate spelling instead. */

jdemeyer · 2018-10-19T20:45:29Z

Can we try to finish this please? We spent a lot of time discussing the implementation and I think that we agree now. If there are any further changes which should be made here, just let me know.

tim-one · 2018-10-19T21:49:41Z

I would like to finish this too. I've been waiting for Raymond to chime in - but if he won't, we still need some other core dev to step in. While I have "the commit bit", I've never used it, and have the most minimal understanding of how the Github workflow is intended to work. That's why I kept posting code on the BPO report instead 😉

rhettinger

Please move the pyhash.h code into tupleobject.c so that this hash is self-contained and doesn't leak of to tuple object code.

rhettinger · 2018-10-22T02:28:19Z

Objects/tupleobject.c

+   non-cryptographic hash:
+   - we do not use any parallellism, there is only 1 accumulator.
+   - we drop the final mixing since this is just a permutation of the
+     output space: it does not help against collisions.


I'm not comfortable dropping the final permutation.

"Final mixing" here refers to the xxHash spec's post-loop avalanche code, which is one long serialized critical path spanning 8 instructions including 2 multiplies. I don't believe that's what you (Raymond) have in mind here at all.

If you're talking about merely adding a large constant, ya, that can't hurt (except to add a cycle to the critical path), but xxHash doesn't do it, and I'm at a loss to dream up "a reason" for why it might do any good.

One of my goals here is to make it as brainless as possible to replace this code with the guts of xxHash version 2, if and when someone in real life finds a horrible case that's added to the SMHasher test suite. So I would really like to see a "good reason" to deviate at all from the spec.

For the final permutation, a single addend would suffice.

If we can't identify the purpose of doing this, how could we know whether it suffices? Best I can tell, you believe it helped with nested tuples in some other function, but one of radically different structure. In this function, the result of any nested tuple's hash is immediately multiplied by _PyHASH_XXPRIME_2 inside the loop, which is a far more disruptive permutation than adding a constant.

What it does as-is is powerful enough that we don't have any hint of a problem in either of the intensely "nested tuple" tests we have now. I'd be astonished if adding a constant hurt the test results - but even more astonished if it helped. I really don't want to add a line of code for which the only comment I could write is:

/* This isn't part of the xxHash spec, and removing it has * no effect on any test we have, nor can we identify a * reason for why it's here. DO NOT CHANGE! */

😉

For reasons I can't fully defend, I think the final addend is important and I'm not comfortable dropping that final step out of the tuplehash. The original xxhash has a complex permutation step, but I think and added will suffice to amplify effects between levels of nesting. It costs one clock, is possibly beneficial, and is utterly harmless. Unless you're dead set against it, let's not battle over this one.

I can't find it again, but I left a comment somewhere saying that making that change (predictably) made no significant difference in any of the tests. Just seemingly randomly made insignificant differences in the number of collisions in about 20% of the tests. So keeping it in is OK by me, but you come up with a comment to explain it ;-)

Explained before why xxHash does what it does: it's striving for avalanche perfection, and there aren't enough permutations inside the loop for each bit of the last input or two to have a decent chance of affecting all the other bits in the full-width hash code. So they pile up more permutations outside the loop, specifically designed for their avalanche properties.

But they're permutations: two full-width hash codes collide after their avalanche code if and only if they collided before their avalanche code. We do care about collisions, but don't care about avalanche perfection, so their avalanche code serves no purpose we care about.

Merely adding a constant would have done nothing to help xxHash meet its avalanche goals; it's quite clear to me why their post-loop code is as long-winded as it is.

Misc/NEWS.d/next/Core and Builtins/2018-09-20-15-41-58.bpo-34751.Yiv0pV.rst

Objects/tupleobject.c

bedevere-bot · 2018-10-22T03:48:39Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

tim-one · 2018-10-22T06:56:50Z

This is maddening - I make comments, and then they vanish, and then later they show up again - and then 6 copies show up - and then they vanish again. Similarly for Raymond's comments. I'm giving up for tonight ☹️

Include/pyhash.h

Objects/tupleobject.c

jdemeyer · 2018-10-22T13:59:33Z

This is maddening - I make comments, and then they vanish, and then later they show up again - and then 6 copies show up - and then they vanish again. Similarly for Raymond's comments. I'm giving up for tonight frowning_face

Could this be related to https://blog.github.com/2018-10-21-october21-incident-report/

jdemeyer · 2018-10-22T15:25:15Z

I have made the requested changes; please review again.

bedevere-bot · 2018-10-22T19:20:19Z

Thanks for making the requested changes!

@rhettinger: please review the changes made to this pull request.

rhettinger · 2018-10-23T00:04:58Z

Objects/tupleobject.c

+    acc += len ^ (_PyHASH_XXPRIME_5 ^ 3527539);
+
+    if (acc == (Py_uhash_t)-1) {
+        return -2;


While we're at it, we could return a more interesting value than -2, something random and large.

I don't see how returning a different value than -2 would help anything. If the hash function is sufficiently random (which we assume it is), then both the values -1 and -2 should be very rare. So replacing -1 by -2 increases collisions by a very very tiny amount. But replacing -1 by, say 1546275796, would increase the number of collisions by the same very very tiny amount.

And then I have to invent yet another random constant which I want to avoid if possible. It would just be additional complication without any gain.

If you insist and come up with a concrete proposal for the random value, I'll make the change anyway.

If the hash function is sufficiently random (which we assume it is)

But we don't. In particular, we know for sure that hash(-1) == hash(-2) == -2, and hashes of little integers are common as dirt. While I doubt we'd ever be able to measure the difference, on the face of it it's attractive not to systematically create even more cases that always return -2.

If you insist and come up with a concrete proposal for the random value, I'll make the change anyway.

You already did! 😄 1546275796 is fine (far less likely than -2 to show up as a hash, and fits in 31 bits).

In particular, we know for sure that hash(-1) == hash(-2) == -2, and hashes of little integers are common as dirt. While I doubt we'd ever be able to measure the difference, on the face of it it's attractive not to systematically create even more cases that always return -2.

OK, I see your point: it's not about avoiding collisions between hashes of tuples, it's about avoiding collision between the hash of a tuple and the hash of a non-tuple (like the integer -2).

rhettinger · 2018-10-23T01:04:15Z

It looks like we're getting close. Please move pyhash.h code into tupleobject.c

jdemeyer · 2018-10-23T08:59:01Z

Please move pyhash.h code into tupleobject.c

Done!

rhettinger · 2018-10-23T22:33:36Z

Overall, it looks good. Will double check it again later (I'm task saturated at the moment) and then apply. Nice work.

the-knights-who-say-ni added the CLA signed label Sep 21, 2018

bedevere-bot added the awaiting review label Sep 21, 2018

jdemeyer force-pushed the bpo34751 branch from 94efc28 to a57c620 Compare September 21, 2018 14:47

rhettinger assigned rhettinger and tim-one Sep 21, 2018

jdemeyer force-pushed the bpo34751 branch 2 times, most recently from cdb33e0 to 9d2f8f6 Compare September 21, 2018 21:29

rhettinger reviewed Sep 22, 2018

View reviewed changes

Objects/tupleobject.c Show resolved Hide resolved

rhettinger reviewed Sep 22, 2018

View reviewed changes

Objects/tupleobject.c Show resolved Hide resolved

rhettinger removed their assignment Sep 23, 2018

jdemeyer force-pushed the bpo34751 branch 8 times, most recently from 3db0a4f to 60936b7 Compare October 1, 2018 14:02

jdemeyer force-pushed the bpo34751 branch 2 times, most recently from 7d446be to 1d9c690 Compare October 7, 2018 13:57

tim-one reviewed Oct 7, 2018

View reviewed changes

jdemeyer force-pushed the bpo34751 branch from 1d9c690 to b60f2b9 Compare October 9, 2018 11:04

tim-one reviewed Oct 10, 2018

View reviewed changes

jdemeyer force-pushed the bpo34751 branch from b60f2b9 to 770c483 Compare October 10, 2018 09:41

tim-one reviewed Oct 11, 2018

View reviewed changes

Misc/NEWS.d/next/Core and Builtins/2018-09-20-15-41-58.bpo-34751.Yiv0pV.rst Outdated Show resolved Hide resolved

tim-one assigned rhettinger Oct 11, 2018

tim-one reviewed Oct 11, 2018

View reviewed changes

jdemeyer force-pushed the bpo34751 branch from 770c483 to ec897a8 Compare October 11, 2018 18:45

rhettinger requested changes Oct 22, 2018

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting review labels Oct 22, 2018

rhettinger requested changes Oct 22, 2018

View reviewed changes

Include/pyhash.h Outdated Show resolved Hide resolved

Include/pyhash.h Outdated Show resolved Hide resolved

Objects/tupleobject.c Outdated Show resolved Hide resolved

jdemeyer force-pushed the bpo34751 branch 2 times, most recently from 00eeca4 to 8440ccb Compare October 22, 2018 13:31

bedevere-bot added awaiting change review and removed awaiting changes labels Oct 22, 2018

rhettinger reviewed Oct 23, 2018

View reviewed changes

jdemeyer force-pushed the bpo34751 branch from 8440ccb to bd73d20 Compare October 23, 2018 08:24

bpo-34751: improved hash function for tuples

075adf8

jdemeyer force-pushed the bpo34751 branch from bd73d20 to 075adf8 Compare October 23, 2018 20:25

Minor clean-ups in preparation for the commit

e36de21

rhettinger approved these changes Oct 27, 2018

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting change review labels Oct 27, 2018

rhettinger merged commit aeb1be5 into python:master Oct 28, 2018

bedevere-bot removed the awaiting merge label Oct 28, 2018

		@@ -0,0 +1,3 @@
		The hash function for tuples is now based on xxHash.
		This makes hash collisions less likely.

bpo-34751: improved hash function for tuples #9471

bpo-34751: improved hash function for tuples #9471

Conversation

jdemeyer commented Sep 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tim-one Oct 7, 2018 • edited Loading

Choose a reason for hiding this comment

jdemeyer Oct 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdemeyer commented Oct 11, 2018

Choose a reason for hiding this comment

jdemeyer commented Oct 19, 2018

tim-one commented Oct 19, 2018

rhettinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bedevere-bot commented Oct 22, 2018

tim-one commented Oct 22, 2018

jdemeyer commented Oct 22, 2018

jdemeyer commented Oct 22, 2018

bedevere-bot commented Oct 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhettinger commented Oct 23, 2018

jdemeyer commented Oct 23, 2018

rhettinger commented Oct 23, 2018

jdemeyer commented Sep 21, 2018 •

edited

Loading

tim-one Oct 7, 2018 •

edited

Loading

jdemeyer Oct 8, 2018 •

edited

Loading