-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative hash function #5146
Conversation
it is multiplicative function resilent to HashDos given following restriction is satisfied: - neither seed nor hash-value is exposed to atacker.
db221fb
to
49dd92e
Compare
src/crystal/hasher.cr
Outdated
end | ||
|
||
@[NoInline] | ||
def string(value : Bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be named def slice(value : Bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for the case, when slice of bytes is a string.
Ok, maybe it could be def bytes(value : Bytes)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, def bytes(value : Bytes)
or just def bytes(value)
since other methods don't specify a type either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ysbaddaden , should I make Slice(UInt8)#hash
to use hasher.bytes(self)
as you did in 7733d3e ?
Can we include a reference or link to somewhere that explains the algorithm? |
There is derived from my https://github.com/funny-falcon/funny_hash/blob/master/README.md There is no formal description, though. It is two multiplicative hash combined that uses different multiplicative constants, and ensures that every block bit affects at least 20 state bits. Guarantee cames from fact both 32bit halves of 64bit block affects are mixed into low 32bits of state halves before multiplication. Looking at https://en.m.wikipedia.org/wiki/Universal_hashing I see some similarity with hashing vectors. Lets be clear. I will not spend my time in debates again. I have no formal verification, I never wrote and published any paper on this. |
Looks like https://github.com/funny-falcon/funny_hash -
Because it like Murmur3 I provide some implementations that use MurMur3 (but Murmur3 has some known defects while Funny hash - not yet): It has been adopted into a number of open-source projects, most notably libstdc++ (ver 4.6), nginx (ver 1.0.1), Rubinius, libmemcached (the C driver for Memcached), npm (nodejs package manager), maatkit, Hadoop, Kyoto Cabinet, RaptorDB, OlegDB, Cassandra, Solr, vowpal wabbit, Elasticsearch, and Guava. |
I vote for this hashing function just because it makes us faster than Go, Rust etc. :) |
Not faster than Go, actually. Their aes-ni function is super fast on long strings, and probably doesn't suck on short. https://github.com/golang/go/blob/master/src/runtime/hash64.go But their function is longer to code, uses larger state for long strings, and I don't think non-aes variant is safer than my function. Note, they have specialization for hashing single Int32 and single Int64 (memhash32, memhash64). They are more free in choosing functions, and their specializations, cause Go doesn't give any ability to provide custom hash functions. All hash calculation is controlled by runtime, and could not be overriden. |
I hashed a dictionary of 216 555 English words:
EDIT I got the list of english words from this outstanding answer: https://softwareengineering.stackexchange.com/a/145633 EDIT 2: that's better :) |
@ysbaddaden , thank you. I've pushed fix. |
@ysbaddaden , and I've pushed your commit with specialization for |
Down to zero collisions, that's better! I'm trying to check distribution, now :) |
LGTM and I want it in It is prerequisite for fast hashes. |
I miss specs for the funny hash algorithm against static vectors, just to make sure we don't reintroduce a bug later (such as the collisions above). |
Also, a link to https://github.com/funny-falcon/funny_hash inside the implementation (not in public docs) |
@ysbaddaden By the way, really cool graphs! How did you generate them? (and what code did you use?) |
b8a4bc3
to
78fc3e9
Compare
src/crystal/hasher.cr
Outdated
# and then combines them with addition. It greatly reduce | ||
# possibility of state deduction. | ||
# | ||
# Note, it provides good protection from HashDos iif: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iif -> if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. iif -> iff
https://en.m.wikipedia.org/wiki/If_and_only_if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or write it without abbreviations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
For fun, I checked distribution a little more, with 200 000 random UUIDs and 42 614 numbers (american zip codes). I got 8 collisions for UUIDs on Crystal 0.23.1, no collisions otherwise. SipHash and FunnyHash feel similar in their distribution. UUIDs (overall distribution is good because
Numbers (the result is so biased on master hashes end up at two x,y spots):
Conclusion: switching to funny hash or siphash will really help avoid concentration in real world scenarios, when keys are poorly distributed, not just mitigating hash flooding attacks. @asterite I used SDL, see https://gist.github.com/ysbaddaden/fccf87d4ca22082c7d645b9f547a9713 |
Changed tests. |
spec/std/crystal/hasher_spec.cr
Outdated
[hasher, hasher1, hasher2, hasher12] | ||
.map(&.result) | ||
.combinations(2) | ||
.each do |(a, b)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong indentation here, and in other cases below. Also, I think it'd look better as:
[hasher, hasher1, hasher2, hasher12]
.map(&.result)
.combinations(2)
.each { |(a, b)| a.should_not eq(b) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You you sure, @Sija?
➜ crystal git:(althasher) ✗ bin/crystal tool format src spec
Using compiled compiler at `.build/crystal'
➜ crystal git:(althasher) ✗ gd
➜ crystal git:(althasher) ✗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akzhan yes I am, block body SHOULD be indented, yet it's not. Perhaps it's a formatter bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But idea with {}
blocks on (should) looks better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crystal tool format
made this indentation.
Using curly braces here is semantically wrong: assertion is an action, not computation or transformation.
But if you insist, I will change to curly braces. Should I?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@funny-falcon lets stay things as is. they automatically fixed with fixes on the formatter.
/cc @makenowjust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block body SHOULD be indented, yet it's not. Perhaps it's a formatter bug?
It is indented relative to statement level. It is not a bug, just current rules. May be rule should be changed, but which way?
May be, method calls should be indented by 4 spaces?
[hasher, hasher1, hasher2, hasher12]
.map(&.result)
.combinations(2)
.each do |(a, b)|
a.should_not eq(b)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luislavena IMO there are 2 separate issues here: formatter bug and bad readability of do ... end
version of block bodies - first one should be handled in separate PR, yet the second one results in badly readable code and could be fixed, so why not do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is matter of multi-line vs one-line block bodies, not the "actionable" intent,
There are different points of view on this question. I prefer curly braces only for "one-line non-actionable".
But, ok, I've changed to curly braces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PLEASE, let's not discuss such small syntax details. If the specs pass it's OK. This only makes the contributor feel tired and annoyed about stuff that doesn't really matter.
There's no formatter bug, the expression starts at indent 0 and the block is at indent 2.
I'm not sure why we've suddenly abstracted the implementation from |
@RX14 You are right, just now I see that change. There should just be |
@RX14, @asterite perhaps the wording of this comment was the issue? |
It's fine to leave just |
@luislavena Probably. I think we have an issue where someone sends a PR and gets many comments saying "Please do this, please do that" and the sender just says "OK, OK" and does it, without first having a good discussion about whether that should be done or not. Who decides it? It's pretty hard. I see this happen in many PRs already. I don't know what the solution is. Maybe not rush doing changes before we finish agreeing on them. |
Next step is near - FAST hashes :) |
Lets border line:
I rush cause I'm not always understand clearly what is asked to do, but I agree with some part of. Another reason, why I rush, is cause discussion could last forever. I also like to discuss a lot. |
My opinion:
Once that's done I think we can merge this. If others agree please 👍 on this comment. Once we have enough 👍 here I guess you can do that and then we can merge 😄 🎉 |
Just to prevent occasional disclosure of state, hide it in `def inspect`. It will not prevent intentional disclosure, though, because state can be easily inspected, and `to_s` could be redefined.
3711529
to
c709970
Compare
Ok, returned to monolith. I've added another one small change, excuse me for that:
|
spec/std/crystal/hasher_spec.cr
Outdated
B | ||
end | ||
|
||
alias THasher = Crystal::Hasher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestHasher
plz :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gr8, thx!
|
||
def nil | ||
@a += @b | ||
@b += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier you use LFSR here, it is not wanted anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both state variable should change in some way.
Combination of addition on @A and lsfr on @b gives period almost 2**128
. That is great, but unnecessary.
Current version is "quadrating probing", and gives period 2**64
. I think, it is just enough.
Even 2**32
could be enough for intended use case, because if someone triggers 2**32
calls to nil.hash(hasher)
,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if some triggers 2**32
calls to nil.hash(hasher)
, they are already successful at DoS-ing.
🎉 🎉 🎉 |
Reimplementing this hash function in Rust, I wondered what would be the difference between this implementation and the original one in C. Turns out (at least for strings) it is not more than a one-digit difference in the Concerning hash functions, I am not knowledgeable in a way, but I wonder if this difference was a) a mistake b) on purpose or c) it does not really matter. Just curious, maybe @funny-falcon can shed some light on this? This does not really matter much, only if I should call this "funny_hash" or "funny_crystal_hash" ;) |
This is multiplicative function resilent to HashDos attack.
(Claim is bounded with restrictions that neither seed nor hash-value is exposed to attacker).
It provides good performance, and at least certainly not slower than current unsafe hasher's hash function.
This PR is alternative to #5104
(though I don't claim this function is alternative to SipHash in all possible SipHash's applications).