-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 0003: OOI Hashing #4004
Open
jpbruinsslot
wants to merge
13
commits into
main
Choose a base branch
from
rfd/0003
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
RFD 0003: OOI Hashing #4004
Changes from 10 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
b803415
RFD 0003: Adding placeholder for RFD
jpbruinsslot d3acd14
Add initial rfd, adapted from issue #4000
jpbruinsslot 00ba87b
Fix author
jpbruinsslot ce157a1
Add additional label
jpbruinsslot 044fd37
Update rfd/0003-ooi-hashing.md
originalsouth a750832
Update rfd/0003-ooi-hashing.md
originalsouth 28e6d06
Update rfd/0003-ooi-hashing.md
originalsouth fa60f19
Update rfd/0003-ooi-hashing.md
originalsouth 6978c5e
Make precommit happy
originalsouth beb8cd2
Merge branch 'main' into rfd/0003
originalsouth 1e12858
Fix references and code block
jpbruinsslot 1b3d1fa
Update rfd/0003-ooi-hashing.md
originalsouth d92fd85
Make precommit happy
originalsouth File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
--- | ||
authors: (@originalsouth) | ||
state: draft | ||
discussion: https://github.com/minvws/nl-kat-coordination/pull/4004 | ||
labels: hashing, pydantic, ooi, octopoes | ||
--- | ||
|
||
# RFD 0003: OOI Hashing | ||
|
||
## Background | ||
|
||
The current `__hash__` implementation of OOI's: | ||
|
||
https://github.com/minvws/nl-kat-coordination/blob/8730e188e9dad6276a363aaeaead8fc1deb82ac9/octopoes/octopoes/models/__init__.py#L242-L243 | ||
|
||
is broken because it only considers the primary key; meaning that OOI's with | ||
fields _not_ recorded in the primary key are erroneously deemed to be the same | ||
objects, causing Python's built-in hash dependent structures to find collisions. | ||
Additionally, we'll have to consider whether we want for: | ||
|
||
d1 = {'a': 1, 'b': 2} | ||
d2 = {'b': 2, 'a': 1} | ||
|
||
to have different hashes (ie. `hash(d1) == hash(d2)` or `hash(d1) != hash(d2)`). | ||
(this because python dicts are ordered) | ||
|
||
originalsouth marked this conversation as resolved.
Show resolved
Hide resolved
originalsouth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Proposal | ||
|
||
Since we are dealing with OOI based on Pydantic BaseModel's we can easily | ||
generate a dict of the object using `model_dump`. Assuming that this is the | ||
best object to start to begin our `__hash__` implementation on, the question | ||
becomes how to best hash a dict (as python still hasn't figured out how to do | ||
this natively). | ||
|
||
The natural question arises why not hash `model_dump_json`? Because there is no | ||
guarantee it is stable see | ||
https://github.com/pydantic/pydantic/discussions/10343. | ||
|
||
## Evaluation | ||
|
||
Hence, here I compare two algorithms with benchmarks: | ||
|
||
originalsouth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. hashing the ooi using the [`jcs`](https://pypi.org/project/jcs/) package | ||
2. hashing the ooi using a custom `freeze()` function | ||
|
||
Code: | ||
|
||
```python | ||
#!/usr/bin/env python | ||
|
||
from typing import Iterable, Any | ||
import jcs | ||
import random | ||
import string | ||
import time | ||
|
||
N = 8 | ||
MAX_DEPTH = 4 | ||
K = 10**6 | ||
|
||
|
||
def random_string(): | ||
return ''.join(random.choice(string.ascii_letters) for _ in range(random.randint(1, N))) | ||
|
||
|
||
def random_obj(depth: int = 0, k: int = 7) -> Any: | ||
if depth < MAX_DEPTH: | ||
rand_choice = random.randint(0, k) | ||
if rand_choice == 0: | ||
return random.random() # float | ||
elif rand_choice == 1: | ||
return random.randint(0, 10**6) # int | ||
elif rand_choice == 2: | ||
return random_string() # string | ||
elif rand_choice == 3: | ||
return [random_obj(depth + 1, 2) for _ in range(random.randint(1, N))] # list | ||
elif rand_choice == 4: | ||
return [random_obj(depth + 1, 2) for _ in range(random.randint(1, N))] # list | ||
# Non JSON compatible so broken for hasher_2 but hasher_1 digests it | ||
# return {random_obj(depth + 1, 2) for _ in range(random.randint(1, N))} # set | ||
else: | ||
return {random_string(): random_obj(depth + 1) for _ in range(random.randint(1, N))} # dict[str, Any] | ||
# Non JSON compatible so broken for hasher_2 but hasher_1 digests it | ||
# return {random_obj(depth + 1, 2): random_obj(depth + 1) for _ in range(random.randint(1, N))} # dict[Any, Any] | ||
else: | ||
return random_string() | ||
|
||
|
||
targets = [random_obj() for _ in range(K)] | ||
|
||
|
||
def hasher_1(obj: Any) -> int: | ||
def freeze(obj: Iterable[Any | Iterable[Any]]) -> Iterable[int]: | ||
if isinstance(obj, Iterable) and not isinstance(obj, (str, bytes)): | ||
if isinstance(obj, dict): | ||
for key, value in obj.items(): | ||
yield hash(key) | ||
yield from freeze(value) | ||
else: | ||
for item in obj: | ||
yield from freeze(item) | ||
else: | ||
yield hash(obj) | ||
return hash(tuple(freeze(obj))) | ||
|
||
|
||
def hasher_2(obj: Any) -> int: | ||
return hash(jcs.canonicalize(obj)) | ||
|
||
|
||
st = time.time_ns() | ||
_ = list(map(hasher_1, targets)) | ||
dt = time.time_ns() - st | ||
|
||
print(f"hasher_1: {dt / 10**9 / K}s") | ||
|
||
|
||
st = time.time_ns() | ||
_ = list(map(hasher_2, targets)) | ||
dt = time.time_ns() - st | ||
|
||
print(f"hasher_2: {dt / 10**9 / K}s") | ||
``` | ||
|
||
Resulting in: | ||
|
||
``` | ||
hasher_1: 2.213041571e-05s | ||
hasher_2: 3.159127834e-05s | ||
``` | ||
|
||
## Determinations | ||
|
||
Personally, I would opt for `hasher_1` as it more flexible and faster, but | ||
`hasher_2` is easier to maintain; also open to other suggestions. | ||
|
||
So how do we proceed to solve this problem? | ||
originalsouth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## References | ||
|
||
- Issue #3808: has to be solved either in that branched or before that branch is merged. | ||
- Issue #4000: original issue and discussion |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a little more explanation on why we chose to override the default hashing and implement hashing only on the primary-key fields: Because we cannot have two different objects in the graph with the same primary key, it makes sense to also have the rest of the codebase 'think' they are the same by hashing only those fields that would lead to different objects in the graph.
I'd also like a little more info on why and where this is problematic for the python code itself if we disregard the notion that we can only keep one version of two objects in the database If their primary key is the same.
Furthermore a bit more details on why this specifically is a problem in the context of dealing with Objects in nibbles where we want to 'see' changes on any field that is used/consumed/read/queried by the nibble, regardless of what the primary key's values are. For example, having TTL for a DNSRecord in the primary key is useless for our Graph, but you could still envision a nibble that alerts users of DNSRecords with TTL's that are either too long, or too short. Not being able to see TTL changes (because from the outside the hash of the ooi stayed the same) would mean we only get to run the Nibble once, which is obviously not what you'd want.
n.b. A test nibble doing exactly this (min / max on TTL based on a config) would be a great way of demonstrating this, and adding functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4000 (comment)