Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

hashString in SRV record generation might lead to collisions #157

Closed
sttts opened this issue May 13, 2015 · 15 comments
Closed

hashString in SRV record generation might lead to collisions #157

sttts opened this issue May 13, 2015 · 15 comments

Comments

@sttts
Copy link
Contributor

sttts commented May 13, 2015

hashString uses the Fowler–Noll–Vo algorithm (http://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function) to hash the task id to 16 bit numbers. 16 bits are prone to collisions with a few hundred tasks (birth paradox collisions with >50% probability happen with only 300 tasks!).

We have to go to at least 32 bits, but even then we have >1% collision probability for 10000 tasks.

@jdef
Copy link
Contributor

jdef commented May 13, 2015

what about using a 64bit crc?

On Wed, May 13, 2015 at 3:57 AM, Dr. Stefan Schimanski <
[email protected]> wrote:

hashString uses the Fowler–Noll–Vo algorithm (
http://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function) to hash the
task id to 16 bit numbers. 16 bits are prone to collisions with a few
hundred tasks (birth paradox collisions with >50% probability happen with
only 300 tasks!).

We have to go to at least 32 bits, but even then we have >1% collision
probability for 10000 tasks.


Reply to this email directly or view it on GitHub
#157.

@sttts
Copy link
Contributor Author

sttts commented May 13, 2015

I guess that's the way to go. Though I question those hash names in general now. From a user point of view I would prefer the task-id (somehow sanitized) in the canonical A-record.

@jdef
Copy link
Contributor

jdef commented May 13, 2015

ok, what if we suffixed the {santized-task-id} with the hash?

On Wed, May 13, 2015 at 8:05 AM, Dr. Stefan Schimanski <
[email protected]> wrote:

I guess that's the way to go. Though I question those hash names in
general now. From a user point of view I would prefer the task-id (somehow
sanitized) in the canonical A-record.


Reply to this email directly or view it on GitHub
#157 (comment)
.

@sttts
Copy link
Contributor Author

sttts commented May 13, 2015

what's the sense of the hash then?

@sttts
Copy link
Contributor Author

sttts commented May 13, 2015

uniqueness comes from the framework id + task id

@jdef
Copy link
Contributor

jdef commented May 13, 2015

there was a case that christos had run into where he wanted/needed more for
uniqueness. maybe because he wasn't already using framework-id (and was
using framework-name instead)? trying to remember the details..

On Wed, May 13, 2015 at 8:34 AM, Dr. Stefan Schimanski <
[email protected]> wrote:

uniqueness comes from the framework id + task id


Reply to this email directly or view it on GitHub
#157 (comment)
.

@sttts
Copy link
Contributor Author

sttts commented May 13, 2015

/cc @kozyraki

@tsenart tsenart added the bug label Aug 11, 2015
@tsenart tsenart modified the milestone: v1.0.0 Oct 6, 2015
@sargun
Copy link
Contributor

sargun commented Nov 12, 2015

What do you think about taking a sha1 hash, encoding it to the extended hex alphabet, and truncating it to 6 characters [3 bytes]? The process of generating these hashes is done "offline" in records/generator.go, so I'm not overly concerned about the performance on sha1. Especially, given that modern CPUs can do thousands of hashes a second, and the typical reload time is under a second.

Trivially testing this with the in-built test with 1e9 entries shows no obvious collisions.

@jdef
Copy link
Contributor

jdef commented Nov 12, 2015

I think we need to understand the scope of this uniqueness requirement a bit more before proceeding. That said, here's my two cents:

  • we don't need to pay the expense of SHA1 for hashing, there are cheaper hashing algs that would serve well. we don't need anything that's crypto secure. that's not the point at all. record generation is done offline, right now, but won't be long term. let's not pick an expensive algorithm up front when we don't need it.
  • using 24 bits out of a 160 bit hash defeats the purpose of crypto hashing; you're not using the hash result in a way that's anywhere close to it's original intent (too severely limiting the resulting bit space)
  • something that's FNV-based is fine since we can mitigate both sticky-state and diffusion

https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function

I'd like to propose that we use FNV-64 and XOR-fold the results (to mitigate the diffusion problem). Sticky-state should not be a problem for us since we'll never feed zero's into the hash. This gives us a reasonable bit space (32 bits), one that's not artificially constrained and that, once encoded, won't be too long to splice into a record name. And it's fast.

I'd also rather see us go with a zbase-32 encoding: it's less "verbose" than extended-hex encoding and if you ever had to type it in zbase-32 was designed to minimize human error w/ respect to transcription.

@sargun
Copy link
Contributor

sargun commented Nov 13, 2015

So, 4-byte FNV64A + xor-fold doesn't have any collisions using 1M samples. Similarly, a 24-bit SHA1 doesn't have any collisions in this sample I ran. At 10M samples, they both showed collisions. In the future, I'd like to tune the quickcheck test to use a deterministic dataset (set the seed manually).

Benchmarks:

BenchmarkHashFNV-8       5000000               273 ns/op
BenchmarkHashSHA-8       2000000               758 ns/op

The benchmark for SHA included hashing the string, converting it to base32 encoding, and truncating.
The benchmark for FNV included hashing the string, converting to bas32 encoding, and folding.

Opinions?

@jdef
Copy link
Contributor

jdef commented Nov 13, 2015

fwiw, regular base32 encoding can leave strange tail chars that probably
don't comply with DNS name rules. hence my suggestion of zbase32. (
https://github.com/tv42/zbase32 ?)

is it worth testing 8-byte FNV-1a 128-bit + xor-fold for comparison (both
in terms of speed and collisions)?

On Fri, Nov 13, 2015 at 12:17 PM, Sargun Dhillon [email protected]
wrote:

So, 4-byte FNV64A + xor-fold doesn't have any collisions using 1M samples.
Similarly, a 24-bit SHA1 doesn't have any collisions in this sample I ran.
At 10M samples, they both showed collisions. In the future, I'd like to
tune the quickcheck test to use a deterministic dataset (set the seed
manually).

Benchmarks:

BenchmarkHashFNV-8 5000000 273 ns/op
BenchmarkHashSHA-8 2000000 758 ns/op

The benchmark for SHA included hashing the string, converting it to base32
encoding, and truncating.
The benchmark for FNV included hashing the string, converting to bas32
encoding, and folding.

Opinions?


Reply to this email directly or view it on GitHub
#157 (comment)
.

@sargun
Copy link
Contributor

sargun commented Nov 13, 2015

The extended hex alphabet is safe for DNS. It uses - instead of = for the trailing chars as well. zbase32 seems reasonable as well. I don't have any strong opinions, as I imagine the cost, and complexity of applying the encoding for either FNV, or SHA is going to be roughly the same cost.

I don't think using 128-bit FNV is a good idea, because it's not included in the Go stdlib. Although, it's fairly trivial to implement our own, or depend on a third-party library, when there are already an excellent selection of hashing algorithms available to us in stdlib, we should take advantage of them, rather than going off the beaten path.

@sargun
Copy link
Contributor

sargun commented Nov 17, 2015

So, it seems like you need another byte for FNV64a to get the same likelihood of collisions as SHA1. Given that name length is precious in DNS, and all of this is precomputed, I err to use SHA1. Especially, because there's less code involved. In fact, we can make the hashString function much simpler:

func hashString(s string) string {
    hash := sha1.Sum([]byte(s))
    return zbase32.EncodeToString(hash[:])[:5]
}

@jdef
Copy link
Contributor

jdef commented Nov 17, 2015

ok, sgtm. we can revisit performance if it becomes a problem later

@sargun
Copy link
Contributor

sargun commented Nov 17, 2015

See: #356

@sargun sargun closed this as completed Nov 17, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants