Issue 147, 154 - Improve citation object hashing behavior #155

mattdahl · 2023-07-15T21:16:59Z

This PR does three things:

Citations with normalized/corrected reporters (e.g., 1 U.S. 1 versus 1 U. S. 1) are now treated as equal (FullCaseCitation.group have badly formatted reporter #147)
Citations with nominative reporters (e.g., 5 U.S. 137 versus 5 U.S. (1 Cranch) 137 are now treated as equal (Citations with and without nominative reporters are not considered equal #154)
Citation hashes are now reproducible/deterministic across runs (FullCaseCitation.group have badly formatted reporter #147 (comment))

In terms of implementation, here's what's changed:

The old comparison_hash() method that a few classes had has been removed entirely. This was originally intended to essentially be a private method for Resource objects to call, but the name was confusing and could have been misinterpreted by a causal user as a public method that should be used for general comparisons. Going forward, anyone should simply use the built-in hash() method for everything, which is the most natural approach anyway.
CitationBase classes and their subclasses now all use the @dataclass(eq=False, unsafe_hash=False) decorator. Previously both eq and unsafe_hash were set to True. As the keyword argument implies, this was somewhat... unsafe. Or at least confusing. See here for a detailed discussion explaining how dataclass objects automatically try to create hash functions when these arguments are True. I have changed this so none of this implicit stuff is happening anymore; instead, we explicitly define __eq__() and __hash__() ourselves on the CitationBase class, which creates a clear record of how the hashing is actually happening. (These methods are then inherited by the citation subclasses, though ResourceCitation and CaseCitation classes override the __hash__() method in order to maintain conformity with existing behavior for those special cases.
To create persistent hashes, I use hashlib.sha256. However, the built-in hash() function will only output hashes that are 32 bits long, so I truncate the sha256 hash using c_int32() from the ctypes standard library. By doing this truncation ourselves, we ensure that calling hash(obj) and obj.__hash__() both return the same hash (for a given object).
This behavior was not previously well-defined, but I have made it so IdCitation and UnknownCitation objects are sui generis. Since we don't really know anything about what these objects point to, I think it would be dangerous to ever hash two of them the same.

If merged, this PR supersedes #148. Sorry again @overmode for the previous confusion, this turned out to be quite complicated. I have used some of your logic and tests in this PR, and it should achieve all the functionality you were looking for.

Feedback welcome!

…ect#154).

mattdahl · 2023-07-15T21:20:11Z

(It looks like the benchmark action is not working again.)

mlissner · 2023-07-17T10:53:22Z

@flooie can you please help get this reviewed? @overmode, I think you might want to take a look too, even if you only glance, since I think it's fixing several bugs you reported.

overmode · 2023-07-18T08:21:56Z

Awesome, thanks for implementing the changes and notifying me.

It has been a while since last time I took a look at eyecite's source code but the PR looks good to me. You might want to document, for each class, what keys are in the groups, as it would make the comparison more transparent.

My colleague @khakhlyuk will also take a look, he is currently working with eyecite.

khakhlyuk · 2023-07-18T10:39:06Z

great work, thank you for implementing this, very useful and very much needed!

The code looks good to me.

The only issue I see is using 32bits for hashes. I am working with a dataset that contains up to 10M unique citations. With a 32bit hash and a dataset of this size hash collisions are inevitable. Even for 100k citations, the probability of a single hash collision is 69%. I've calculated the probability here: https://kevingal.com/apps/collision.html

Is python's default hash() output of 32bits a limitation? I have several thoughts on this.

hash() returns 64-bit ints on 64-bit platforms anyway
i have tried using larger ints as output of a hash function and it works. I have experimented with c_int64 and even using the 256-bit hash directly - int.from_bytes(hashlib.sha256(json_bytes).digest(), byteorder="big"). I have used Citation objects with 64-bit and 256-bit hash as keys for dict and set, everything work fine.
afaik when creating sets or dicts, python will apply modulo to the hash value. That's why it does not make a very big difference if the hash value is 32bit, 64bit or even 256 bit.

Is there another reason to limit output to a 32-bit int? If not, it would be really nice to change the hash to 64bit ints in my opinion.

mattdahl · 2023-07-18T17:00:10Z

3. afaik when creating sets or dicts, python will apply modulo to the hash value. That's why it does not make a very big difference if the hash value is 32bit, 64bit or even 256 bit.

Ah, this is interesting! I did not realize, but for numeric types, you are correct that it just performs a modulo reduction: https://github.com/python/cpython/blob/main/Python/pyhash.c#L34

In that case, I will remove the truncation logic and let hash() do its own thing. The only problem I foresee with this approach is that the hashes will not be the same across 32 and 64 bit machines (because of the way the modulo is implemented), but I don't think that's a serious concern.

Thanks for pointing this out.

mattdahl · 2023-07-18T21:18:14Z

All right, 0a30559 removes the truncation, exploiting the fact that calling hash() twice on an arbitrarily large integer (e.g., hash(hash(123))) guarantees reproducibility (because of that modulo trick). Let me know what you think! (This means that the hashes should be 64-bit on 64-bit machines.)

mattdahl · 2023-07-18T21:20:58Z

@overmode I agree that adding documentation of the expected group keys would be good, I'll get to that soon.

mlissner · 2023-07-18T21:42:52Z

Thanks all. @flooie final review rests with you, but if the other folks on this thread want to take another look at the latest commits first, I think that'd be great. Thank you all!

khakhlyuk · 2023-07-19T08:57:15Z

i like the trick with applying an additional hash(). Good solution!
LGTM

…n hash function.

mattdahl · 2023-07-21T18:15:36Z

I added some documentation spelling out the expected content of self.groups for the different citation classes, as suggested by @overmode . As I was doing this, however, I realized that the volume key is not guaranteed to exist for CaseCitation objects. (The reporter and page keys are, at least they should be, because of a test we have in reporters_db. This is also reflected in the new documentation.) Thus, my previous code would crash for e.g. certain tax court citations. I've fixed that bug now too.

mattdahl · 2023-09-22T18:37:32Z

Comrade @flooie, just wanted to bump this if you have a chance to review and merge.

mattdahl added 6 commits July 14, 2023 23:08

test(models): Adds failing test for citation with corrected reporter.

9495744

feat(models): Explicitly declares citation object __hash__() functions.

b46cf5d

test(models): Adds new tests for new hashing behavior.

bf5a3d1

feat(models): Implements reproducible hashing of citation objects.

7b97626

test(models): Adds new tests for reproducible hashing.

5626b16

test(models): Adds test for nominative reporter equality (freelawproj…

c80526f

…ect#154).

refactor(models): Removes 32-bit truncation of hashes.

0a30559

flooie and others added 4 commits July 19, 2023 06:47

Merge branch 'main' into issue-147-fix-hashing

776409c

test(models): Adds failing test for tax citation hashing.

444250d

fix(models): Fixes bug re potentially nonexistent keys in CaseCitatio…

38171a9

…n hash function.

feat(docs): Documents self.groups content for different classes.

a519873

flooie merged commit 65832f9 into freelawproject:main Sep 22, 2023

mattdahl mentioned this pull request Feb 3, 2024

Update CHANGES #162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 147, 154 - Improve citation object hashing behavior #155

Issue 147, 154 - Improve citation object hashing behavior #155

mattdahl commented Jul 15, 2023 •

edited

Loading

mattdahl commented Jul 15, 2023

mlissner commented Jul 17, 2023

overmode commented Jul 18, 2023

khakhlyuk commented Jul 18, 2023 •

edited

Loading

mattdahl commented Jul 18, 2023

mattdahl commented Jul 18, 2023 •

edited

Loading

mattdahl commented Jul 18, 2023

mlissner commented Jul 18, 2023

khakhlyuk commented Jul 19, 2023

mattdahl commented Jul 21, 2023

mattdahl commented Sep 22, 2023

Issue 147, 154 - Improve citation object hashing behavior #155

Issue 147, 154 - Improve citation object hashing behavior #155

Conversation

mattdahl commented Jul 15, 2023 • edited Loading

mattdahl commented Jul 15, 2023

mlissner commented Jul 17, 2023

overmode commented Jul 18, 2023

khakhlyuk commented Jul 18, 2023 • edited Loading

mattdahl commented Jul 18, 2023

mattdahl commented Jul 18, 2023 • edited Loading

mattdahl commented Jul 18, 2023

mlissner commented Jul 18, 2023

khakhlyuk commented Jul 19, 2023

mattdahl commented Jul 21, 2023

mattdahl commented Sep 22, 2023

mattdahl commented Jul 15, 2023 •

edited

Loading

khakhlyuk commented Jul 18, 2023 •

edited

Loading

mattdahl commented Jul 18, 2023 •

edited

Loading