Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 0003: OOI Hashing #4004

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
148 changes: 148 additions & 0 deletions rfd/0003-ooi-hashing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
---
authors: (@originalsouth)
state: draft
discussion: https://github.com/minvws/nl-kat-coordination/pull/4004
labels: hashing, pydantic, ooi, octopoes
---

# RFD 0003: OOI Hashing

## Background
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a little more explanation on why we chose to override the default hashing and implement hashing only on the primary-key fields: Because we cannot have two different objects in the graph with the same primary key, it makes sense to also have the rest of the codebase 'think' they are the same by hashing only those fields that would lead to different objects in the graph.
I'd also like a little more info on why and where this is problematic for the python code itself if we disregard the notion that we can only keep one version of two objects in the database If their primary key is the same.
Furthermore a bit more details on why this specifically is a problem in the context of dealing with Objects in nibbles where we want to 'see' changes on any field that is used/consumed/read/queried by the nibble, regardless of what the primary key's values are. For example, having TTL for a DNSRecord in the primary key is useless for our Graph, but you could still envision a nibble that alerts users of DNSRecords with TTL's that are either too long, or too short. Not being able to see TTL changes (because from the outside the hash of the ooi stayed the same) would mean we only get to run the Nibble once, which is obviously not what you'd want.

n.b. A test nibble doing exactly this (min / max on TTL based on a config) would be a great way of demonstrating this, and adding functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The current `__hash__` implementation of OOI's:

https://github.com/minvws/nl-kat-coordination/blob/8730e188e9dad6276a363aaeaead8fc1deb82ac9/octopoes/octopoes/models/__init__.py#L242-L243

is broken because it only considers the primary key; meaning that OOI's with
fields _not_ recorded in the primary key are erroneously deemed to be the same
objects, causing Python's built-in hash dependent structures to find collisions.
Additionally, we'll have to consider whether we want for:

```python
d1 = {'a': 1, 'b': 2}
d2 = {'b': 2, 'a': 1}
```

to have different hashes (ie. `hash(d1) == hash(d2)` or `hash(d1) != hash(d2)`).
(this because python dicts are ordered)
These issues are particular problematic for nibbles:

- Nibbles xxhash OOI's for recalculation, the underlying serialization (necessary for the resolution of this problem) must be reliable for hashing.
- Nibbles are written with uniqueness of objects in mind and uses dicts and sets as such; they will likely suffer performance impact. Both because the data structures get slower and because nibbles will run/rerun unexpectedly.

originalsouth marked this conversation as resolved.
Show resolved Hide resolved
originalsouth marked this conversation as resolved.
Show resolved Hide resolved
## Proposal

Since we are dealing with OOI based on Pydantic BaseModel's we can easily
generate a dict of the object using `model_dump`. Assuming that this is the
best object to start to begin our `__hash__` implementation on, the question
becomes how to best hash a dict (as python still hasn't figured out how to do
this natively).

The natural question arises why not hash `model_dump_json`? Because there is no
guarantee it is stable see
https://github.com/pydantic/pydantic/discussions/10343.

## Evaluation

Hence, here I compare two algorithms with benchmarks:

originalsouth marked this conversation as resolved.
Show resolved Hide resolved
1. hashing the ooi using the [`jcs`](https://pypi.org/project/jcs/) package
2. hashing the ooi using a custom `freeze()` function

Code:

```python
#!/usr/bin/env python

from typing import Iterable, Any
import jcs
import random
import string
import time

N = 8
MAX_DEPTH = 4
K = 10**6


def random_string():
return ''.join(random.choice(string.ascii_letters) for _ in range(random.randint(1, N)))


def random_obj(depth: int = 0, k: int = 7) -> Any:
if depth < MAX_DEPTH:
rand_choice = random.randint(0, k)
if rand_choice == 0:
return random.random() # float
elif rand_choice == 1:
return random.randint(0, 10**6) # int
elif rand_choice == 2:
return random_string() # string
elif rand_choice == 3:
return [random_obj(depth + 1, 2) for _ in range(random.randint(1, N))] # list
elif rand_choice == 4:
return [random_obj(depth + 1, 2) for _ in range(random.randint(1, N))] # list
# Non JSON compatible so broken for hasher_2 but hasher_1 digests it
# return {random_obj(depth + 1, 2) for _ in range(random.randint(1, N))} # set
else:
return {random_string(): random_obj(depth + 1) for _ in range(random.randint(1, N))} # dict[str, Any]
# Non JSON compatible so broken for hasher_2 but hasher_1 digests it
# return {random_obj(depth + 1, 2): random_obj(depth + 1) for _ in range(random.randint(1, N))} # dict[Any, Any]
else:
return random_string()


targets = [random_obj() for _ in range(K)]


def hasher_1(obj: Any) -> int:
def freeze(obj: Iterable[Any | Iterable[Any]]) -> Iterable[int]:
if isinstance(obj, Iterable) and not isinstance(obj, (str, bytes)):
if isinstance(obj, dict):
for key, value in obj.items():
yield hash(key)
yield from freeze(value)
else:
for item in obj:
yield from freeze(item)
else:
yield hash(obj)
return hash(tuple(freeze(obj)))


def hasher_2(obj: Any) -> int:
return hash(jcs.canonicalize(obj))


st = time.time_ns()
_ = list(map(hasher_1, targets))
dt = time.time_ns() - st

print(f"hasher_1: {dt / 10**9 / K}s")


st = time.time_ns()
_ = list(map(hasher_2, targets))
dt = time.time_ns() - st

print(f"hasher_2: {dt / 10**9 / K}s")
```

Resulting in:

```
hasher_1: 2.213041571e-05s
hasher_2: 3.159127834e-05s
```

## Determinations

Personally, I would opt for `hasher_1` as it more flexible and faster, but
`hasher_2` is easier to maintain; also open to other suggestions.

So how do we proceed to solve this problem?
originalsouth marked this conversation as resolved.
Show resolved Hide resolved

## References

- [Issue #3808](https://github.com/minvws/nl-kat-coordination/issues/3808): has to be solved either in that branched or before that branch is merged.
- [Issue #4000](https://github.com/minvws/nl-kat-coordination/issues/4000): original issue and discussion
Loading