-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 147, 154 - Improve citation object hashing behavior #155
Conversation
(It looks like the benchmark action is not working again.) |
Awesome, thanks for implementing the changes and notifying me. It has been a while since last time I took a look at eyecite's source code but the PR looks good to me. You might want to document, for each class, what keys are in the groups, as it would make the comparison more transparent. My colleague @khakhlyuk will also take a look, he is currently working with eyecite. |
great work, thank you for implementing this, very useful and very much needed! The code looks good to me. The only issue I see is using 32bits for hashes. I am working with a dataset that contains up to 10M unique citations. With a 32bit hash and a dataset of this size hash collisions are inevitable. Even for 100k citations, the probability of a single hash collision is 69%. I've calculated the probability here: https://kevingal.com/apps/collision.html Is python's default
Is there another reason to limit output to a 32-bit int? If not, it would be really nice to change the hash to 64bit ints in my opinion. |
Ah, this is interesting! I did not realize, but for numeric types, you are correct that it just performs a modulo reduction: https://github.com/python/cpython/blob/main/Python/pyhash.c#L34 In that case, I will remove the truncation logic and let Thanks for pointing this out. |
All right, 0a30559 removes the truncation, exploiting the fact that calling |
@overmode I agree that adding documentation of the expected group keys would be good, I'll get to that soon. |
Thanks all. @flooie final review rests with you, but if the other folks on this thread want to take another look at the latest commits first, I think that'd be great. Thank you all! |
i like the trick with applying an additional hash(). Good solution! |
I added some documentation spelling out the expected content of |
Comrade @flooie, just wanted to bump this if you have a chance to review and merge. |
This PR does three things:
1 U.S. 1
versus1 U. S. 1
) are now treated as equal (FullCaseCitation.group have badly formatted reporter #147)5 U.S. 137
versus5 U.S. (1 Cranch) 137
are now treated as equal (Citations with and without nominative reporters are not considered equal #154)In terms of implementation, here's what's changed:
comparison_hash()
method that a few classes had has been removed entirely. This was originally intended to essentially be a private method forResource
objects to call, but the name was confusing and could have been misinterpreted by a causal user as a public method that should be used for general comparisons. Going forward, anyone should simply use the built-inhash()
method for everything, which is the most natural approach anyway.CitationBase
classes and their subclasses now all use the@dataclass(eq=False, unsafe_hash=False)
decorator. Previously botheq
andunsafe_hash
were set toTrue
. As the keyword argument implies, this was somewhat... unsafe. Or at least confusing. See here for a detailed discussion explaining how dataclass objects automatically try to create hash functions when these arguments areTrue
. I have changed this so none of this implicit stuff is happening anymore; instead, we explicitly define__eq__()
and__hash__()
ourselves on theCitationBase
class, which creates a clear record of how the hashing is actually happening. (These methods are then inherited by the citation subclasses, thoughResourceCitation
andCaseCitation
classes override the__hash__()
method in order to maintain conformity with existing behavior for those special cases.hashlib.sha256
. However, the built-inhash()
function will only output hashes that are 32 bits long, so I truncate the sha256 hash usingc_int32()
from thectypes
standard library. By doing this truncation ourselves, we ensure that callinghash(obj)
andobj.__hash__()
both return the same hash (for a given object).IdCitation
andUnknownCitation
objects are sui generis. Since we don't really know anything about what these objects point to, I think it would be dangerous to ever hash two of them the same.If merged, this PR supersedes #148. Sorry again @overmode for the previous confusion, this turned out to be quite complicated. I have used some of your logic and tests in this PR, and it should achieve all the functionality you were looking for.
Feedback welcome!