-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-34751: improved hash function for tuples #9471
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4 changes: 4 additions & 0 deletions
4
Misc/NEWS.d/next/Core and Builtins/2018-09-20-15-41-58.bpo-34751.Yiv0pV.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
The hash function for tuples is now based on xxHash | ||
which gives better collision results on (formerly) pathological cases. | ||
Additionally, on 64-bit systems it improves tuple hashes in general. | ||
Patch by Jeroen Demeyer with substantial contributions by Tim Peters. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not comfortable dropping the final permutation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Final mixing" here refers to the xxHash spec's post-loop avalanche code, which is one long serialized critical path spanning 8 instructions including 2 multiplies. I don't believe that's what you (Raymond) have in mind here at all.
If you're talking about merely adding a large constant, ya, that can't hurt (except to add a cycle to the critical path), but xxHash doesn't do it, and I'm at a loss to dream up "a reason" for why it might do any good.
One of my goals here is to make it as brainless as possible to replace this code with the guts of xxHash version 2, if and when someone in real life finds a horrible case that's added to the SMHasher test suite. So I would really like to see a "good reason" to deviate at all from the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the final permutation, a single addend would suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't identify the purpose of doing this, how could we know whether it suffices? Best I can tell, you believe it helped with nested tuples in some other function, but one of radically different structure. In this function, the result of any nested tuple's hash is immediately multiplied by
_PyHASH_XXPRIME_2
inside the loop, which is a far more disruptive permutation than adding a constant.What it does as-is is powerful enough that we don't have any hint of a problem in either of the intensely "nested tuple" tests we have now. I'd be astonished if adding a constant hurt the test results - but even more astonished if it helped. I really don't want to add a line of code for which the only comment I could write is:
😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reasons I can't fully defend, I think the final addend is important and I'm not comfortable dropping that final step out of the tuplehash. The original xxhash has a complex permutation step, but I think and added will suffice to amplify effects between levels of nesting. It costs one clock, is possibly beneficial, and is utterly harmless. Unless you're dead set against it, let's not battle over this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find it again, but I left a comment somewhere saying that making that change (predictably) made no significant difference in any of the tests. Just seemingly randomly made insignificant differences in the number of collisions in about 20% of the tests. So keeping it in is OK by me, but you come up with a comment to explain it ;-)
Explained before why xxHash does what it does: it's striving for avalanche perfection, and there aren't enough permutations inside the loop for each bit of the last input or two to have a decent chance of affecting all the other bits in the full-width hash code. So they pile up more permutations outside the loop, specifically designed for their avalanche properties.
But they're permutations: two full-width hash codes collide after their avalanche code if and only if they collided before their avalanche code. We do care about collisions, but don't care about avalanche perfection, so their avalanche code serves no purpose we care about.
Merely adding a constant would have done nothing to help xxHash meet its avalanche goals; it's quite clear to me why their post-loop code is as long-winded as it is.