You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are currently two kinds of tests for ensuring that hashes have a collision probability that's correlated with similarity:
Create three inputs x, y, and z where x and y have high similarity while x and z have low similarity. Check that x and y have a higher hash collision rate than x and z for a large number of randomly sampled hash functions from the relevant LSH family.
Generate two inputs x and y and compute their similarity. Calculate the theoretical probability that a randomly selected hash function from the relevant LSH family experiences a collision between x and y. Then sample a large number of hash functions and ensure that the collision rate is within some margin of the expected collision rate (generally within ±0.05).
The first type of test works fine, especially when writing code for a new hash function or when the similarity / theoretical collision probability is difficult to compute. In general the second type of test is preferable. However, it's a little inaccurate because we should be using a confidence interval instead of a fixed interval like p±0.05. With a confidence interval we'd have a better idea of the expected range of values that can be taken on by the hash collision frequencies, justifying the number of trials we perform to ensure that theoretical probability roughly matches observed frequency.
Solution
The event that hashfn(x) == hashfn(y) is a Bernoulli random variable with parameter equal to the theoretical collision probability between x and y given their similarity and the choice of hash function. With this in mind, we could compute an r% confidence interval for the observed collision probability in one of two ways:
Model collision frequency as a binomial random variable: given N hash functions, the number of times that hashfn(x) == hashfn(y) is a binomial random variable. With that knowledge we can directly compute an r% confidence interval for hash collision frequencies that would occur under the null hypothesis that the true hash collision rate is p.
Model collision frequency as a normal random variable: if we have a sufficiently large number of hash functions then the distribution of collision frequencies is approximately normal (by Law of Large Numbers). In particular, since a Bernoulli random variable has mean p and variance p(1-p), the distribution of collision frequencies should be approximately normal with mean p and variance p^2(1-p)^2 / N.
The first method is preferable because it's more exact, but the second is probably sufficient and generally easier to compute. In either case, the Distributions.jl package or StatBase.jl package should provide the functionality needed to compute these confidence intervals.
The text was updated successfully, but these errors were encountered:
Problem
There are currently two kinds of tests for ensuring that hashes have a collision probability that's correlated with similarity:
x
,y
, andz
wherex
andy
have high similarity whilex
andz
have low similarity. Check thatx
andy
have a higher hash collision rate thanx
andz
for a large number of randomly sampled hash functions from the relevant LSH family.x
andy
and compute their similarity. Calculate the theoretical probability that a randomly selected hash function from the relevant LSH family experiences a collision betweenx
andy
. Then sample a large number of hash functions and ensure that the collision rate is within some margin of the expected collision rate (generally within±0.05
).The first type of test works fine, especially when writing code for a new hash function or when the similarity / theoretical collision probability is difficult to compute. In general the second type of test is preferable. However, it's a little inaccurate because we should be using a confidence interval instead of a fixed interval like
p±0.05
. With a confidence interval we'd have a better idea of the expected range of values that can be taken on by the hash collision frequencies, justifying the number of trials we perform to ensure that theoretical probability roughly matches observed frequency.Solution
The event that
hashfn(x) == hashfn(y)
is a Bernoulli random variable with parameter equal to the theoretical collision probability betweenx
andy
given their similarity and the choice of hash function. With this in mind, we could compute anr%
confidence interval for the observed collision probability in one of two ways:N
hash functions, the number of times thathashfn(x) == hashfn(y)
is a binomial random variable. With that knowledge we can directly compute anr%
confidence interval for hash collision frequencies that would occur under the null hypothesis that the true hash collision rate isp
.p
and variancep(1-p)
, the distribution of collision frequencies should be approximately normal with meanp
and variancep^2(1-p)^2 / N
.The first method is preferable because it's more exact, but the second is probably sufficient and generally easier to compute. In either case, the
Distributions.jl
package orStatBase.jl
package should provide the functionality needed to compute these confidence intervals.The text was updated successfully, but these errors were encountered: