Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash function is invariant to reversal of arrays of symbols #20744

Closed
NickMcNutt opened this issue Feb 22, 2017 · 14 comments
Closed

Hash function is invariant to reversal of arrays of symbols #20744

NickMcNutt opened this issue Feb 22, 2017 · 14 comments

Comments

@NickMcNutt
Copy link

Julia Version 0.6.0-dev.2893
Commit 33f464ed25* (2017-02-21 17:48 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.5.0)
  CPU: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

From the manual for hash

Compute an integer hash code such that isequal(x,y) implies hash(x)==hash(y).

While the converse of this statement isn't strictly implied, I think it would be a reasonable thing to expect from a hash function (barring adversarial hash collisions), and this isn't the case for vectors of symbols:

isequal([1, 2, 3], [3, 2, 1]) # false
hash([1, 2, 3]) == hash([3, 2, 1]) # false

isequal([:a, :b, :c], [:c, :b, :a]) # false
hash([:a, :b, :c]) == hash([:c, :b, :a]) # true

Would it be possible to modify the hash function so that it is no longer invariant to commonly used object transformations?

@stevengj
Copy link
Member

stevengj commented Feb 22, 2017

The root problem here is that

hash(:c, hash(:b, hash(:a, zero(UInt)))) == hash(:a, hash(:b, hash(:c, zero(UInt))))

returns true, as a consequence of hash(x::Symbol, h::UInt) = 3*object_id(x) - h.

@JeffBezanson
Copy link
Member

I think we should bring back the bitmix function deprecated in 0.3 (!). bitmix(x,y) was deprecated to hash(x,y), but I'm not really happy with that since hash has to pay the cost of hashing x numerically, when all we need is mixing. This makes it easier to mix multiple hash values more thoroughly.

@JeffreySarnoff
Copy link
Contributor

@JeffBezanson Where may I see that bitmix(x,y) implementation?

@JeffBezanson
Copy link
Member

It called C functions in src/support/hashing.[ch].

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Feb 23, 2017

Using bitmix for mixing hashes was causing us to do about twice as much hashing as necessary, which is why I got rid of it – we were hashing each value and then bitmixing the hashes. With a Merkle-Damgard construction, you only need to do one hash operation per object, not two. The trouble here is that we're doing zero actual hash operations per object and instead relying on object_id to be "random". The correct solution here is that symbol hashing (and general object hashing) should not be so simple and actually use a real hash function – i.e. not a function that can easily be predicted and inverted.

@stevengj
Copy link
Member

@StefanKarpinski, this issue has nothing to do with whether object_id is "random" or a good hash. The problem is with the mixing — you get exactly the same collision if you use the same mixing function with rand(UInt) "hashes":

julia> badmix(x::UInt, h::UInt) = 3*x - h
badmix (generic function with 1 method)

julia> x1 = rand(UInt); x2 = rand(UInt); x3 = rand(UInt);

julia> badmix(x1, badmix(x2, badmix(x3, zero(UInt)))) == badmix(x3, badmix(x2, badmix(x1, zero(UInt))))
true

If we just defined hash(x::Symbol, h::UInt) = bitmix(object_id(x), h) with a better bitmix function, it seems like the current issue would be fixed..

@stevengj
Copy link
Member

I can confirm that using the old bitmix function fixes the problem:

julia> int64hash(x) = ccall(:int64hash, UInt, (UInt,), x)
int64hash (generic function with 1 method)

julia> bitmix(a, b) = int64hash(xor(a, bswap(b)))
bitmix (generic function with 1 method)

julia> myhash(x::Symbol, h::UInt) = bitmix(object_id(x), h)
myhash (generic function with 1 method)

julia> myhash(:c, myhash(:b, myhash(:a, zero(UInt)))) == myhash(:a, myhash(:b, myhash(:c, zero(UInt))))
false

@stevengj
Copy link
Member

stevengj commented Feb 23, 2017

An even simpler solution, that doesn't require the oldbitmix:

julia> myhash(x::Symbol, h::UInt) = hash(object_id(x), h)
myhash (generic function with 1 method)

julia> myhash(:c, myhash(:b, myhash(:a, zero(UInt)))) == myhash(:a, myhash(:b, myhash(:c, zero(UInt))))
false

Maybe this is what Stefan mean by "using an actual hash function?" I was distinguishing between "hashing x" and "mixing with h", but Stefan was not?

@StefanKarpinski
Copy link
Member

I already have a PR fixing this. It's essentially the same as your fix except that it uses low-level bit hashing (hash_uint) directly instead of going through the hash function, which is the "cost of hashing x numerically" that @JeffBezanson was unhappy about paying. Although the cost is pretty low for UInt values, it's not nothing (there's an uitofp call required and some other ops).

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Feb 23, 2017

I was curious about the history of this: it was originally correct but was broken here in order to not pay the hashing cost for objects since object_id is already a hash. However, when chaining hashes of objects, this means no real hashing is ever done at all. A somewhat cheaper solution here would be to change the hashing function to 3h - object_id(x) since then hash(a, hash(b, hash(c, h))) wouldn't produce 3object_id(a) - 3object_id(b) + 3object_id(c) - h but would instead produce 27h - 9object_id(a) - 3object_id(b) - object_id(c) which at least treats a and c asymmetrically. But that still strikes me as fairly sketchy since we're not doing any real hashing – it's all simple linear operations. The hash_uint function I'm using in my PR is exactly translated from the C int64hash function that was previously used to define bitmix, so it's hard to see how my PR could be objectionable from a performance standpoint.

@JeffreySarnoff
Copy link
Contributor

y'all seem to be satisfied with MurmurHash3 .. why?

@StefanKarpinski
Copy link
Member

Inertia? That's only used for hashing large data blobs, this is a different hash function entirely.

@StefanKarpinski
Copy link
Member

Another way to avoid any hashing beyond the computation of object_id in the single-value case would be to define single-argument hash(x::ANY) = object_id(x) and then define hash(x::ANY, h::UInt) = object_id(x) $ hash_uint(h) or something like that. This saves a single hash operation in any M-D style chain of hash operations, in particular doing no hashes in the single-object case. I'm not sure if object_id is good enough as a hash.

@JeffBezanson
Copy link
Member

Yes, I agree your PR using hash_uint is acceptable. object_id also internally uses the same hashing and mixing functions (e.g. int64hash), so it's probably good enough on its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants