hashString in SRV record generation might lead to collisions #157

sttts · 2015-05-13T07:57:14Z

hashString uses the Fowler–Noll–Vo algorithm (http://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function) to hash the task id to 16 bit numbers. 16 bits are prone to collisions with a few hundred tasks (birth paradox collisions with >50% probability happen with only 300 tasks!).

We have to go to at least 32 bits, but even then we have >1% collision probability for 10000 tasks.

The text was updated successfully, but these errors were encountered:

jdef · 2015-05-13T09:08:54Z

what about using a 64bit crc?

On Wed, May 13, 2015 at 3:57 AM, Dr. Stefan Schimanski <
[email protected]> wrote:

hashString uses the Fowler–Noll–Vo algorithm (
http://en.wikipedia.org/wiki/Fowler–Noll–Vo_hash_function) to hash the
task id to 16 bit numbers. 16 bits are prone to collisions with a few
hundred tasks (birth paradox collisions with >50% probability happen with
only 300 tasks!).

We have to go to at least 32 bits, but even then we have >1% collision
probability for 10000 tasks.

—
Reply to this email directly or view it on GitHub
#157.

sttts · 2015-05-13T12:05:55Z

I guess that's the way to go. Though I question those hash names in general now. From a user point of view I would prefer the task-id (somehow sanitized) in the canonical A-record.

jdef · 2015-05-13T12:32:04Z

ok, what if we suffixed the {santized-task-id} with the hash?

On Wed, May 13, 2015 at 8:05 AM, Dr. Stefan Schimanski <
[email protected]> wrote:

I guess that's the way to go. Though I question those hash names in
general now. From a user point of view I would prefer the task-id (somehow
sanitized) in the canonical A-record.

—
Reply to this email directly or view it on GitHub
#157 (comment)
.

sttts · 2015-05-13T12:34:18Z

what's the sense of the hash then?

sttts · 2015-05-13T12:34:34Z

uniqueness comes from the framework id + task id

jdef · 2015-05-13T12:44:48Z

there was a case that christos had run into where he wanted/needed more for
uniqueness. maybe because he wasn't already using framework-id (and was
using framework-name instead)? trying to remember the details..

On Wed, May 13, 2015 at 8:34 AM, Dr. Stefan Schimanski <
[email protected]> wrote:

uniqueness comes from the framework id + task id

—
Reply to this email directly or view it on GitHub
#157 (comment)
.

sttts · 2015-05-13T12:49:17Z

/cc @kozyraki

sargun · 2015-11-12T08:41:43Z

What do you think about taking a sha1 hash, encoding it to the extended hex alphabet, and truncating it to 6 characters [3 bytes]? The process of generating these hashes is done "offline" in records/generator.go, so I'm not overly concerned about the performance on sha1. Especially, given that modern CPUs can do thousands of hashes a second, and the typical reload time is under a second.

Trivially testing this with the in-built test with 1e9 entries shows no obvious collisions.

jdef · 2015-11-12T14:13:29Z

I think we need to understand the scope of this uniqueness requirement a bit more before proceeding. That said, here's my two cents:

we don't need to pay the expense of SHA1 for hashing, there are cheaper hashing algs that would serve well. we don't need anything that's crypto secure. that's not the point at all. record generation is done offline, right now, but won't be long term. let's not pick an expensive algorithm up front when we don't need it.
using 24 bits out of a 160 bit hash defeats the purpose of crypto hashing; you're not using the hash result in a way that's anywhere close to it's original intent (too severely limiting the resulting bit space)
something that's FNV-based is fine since we can mitigate both sticky-state and diffusion

https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function

I'd like to propose that we use FNV-64 and XOR-fold the results (to mitigate the diffusion problem). Sticky-state should not be a problem for us since we'll never feed zero's into the hash. This gives us a reasonable bit space (32 bits), one that's not artificially constrained and that, once encoded, won't be too long to splice into a record name. And it's fast.

I'd also rather see us go with a zbase-32 encoding: it's less "verbose" than extended-hex encoding and if you ever had to type it in zbase-32 was designed to minimize human error w/ respect to transcription.

sargun · 2015-11-13T17:17:07Z

So, 4-byte FNV64A + xor-fold doesn't have any collisions using 1M samples. Similarly, a 24-bit SHA1 doesn't have any collisions in this sample I ran. At 10M samples, they both showed collisions. In the future, I'd like to tune the quickcheck test to use a deterministic dataset (set the seed manually).

Benchmarks:

BenchmarkHashFNV-8       5000000               273 ns/op
BenchmarkHashSHA-8       2000000               758 ns/op

The benchmark for SHA included hashing the string, converting it to base32 encoding, and truncating.
The benchmark for FNV included hashing the string, converting to bas32 encoding, and folding.

Opinions?

jdef · 2015-11-13T17:26:41Z

fwiw, regular base32 encoding can leave strange tail chars that probably
don't comply with DNS name rules. hence my suggestion of zbase32. (
https://github.com/tv42/zbase32 ?)

is it worth testing 8-byte FNV-1a 128-bit + xor-fold for comparison (both
in terms of speed and collisions)?

On Fri, Nov 13, 2015 at 12:17 PM, Sargun Dhillon [email protected]
wrote:

So, 4-byte FNV64A + xor-fold doesn't have any collisions using 1M samples.
Similarly, a 24-bit SHA1 doesn't have any collisions in this sample I ran.
At 10M samples, they both showed collisions. In the future, I'd like to
tune the quickcheck test to use a deterministic dataset (set the seed
manually).

Benchmarks:

BenchmarkHashFNV-8 5000000 273 ns/op
BenchmarkHashSHA-8 2000000 758 ns/op

The benchmark for SHA included hashing the string, converting it to base32
encoding, and truncating.
The benchmark for FNV included hashing the string, converting to bas32
encoding, and folding.

Opinions?

—
Reply to this email directly or view it on GitHub
#157 (comment)
.

sargun · 2015-11-13T18:03:17Z

The extended hex alphabet is safe for DNS. It uses - instead of = for the trailing chars as well. zbase32 seems reasonable as well. I don't have any strong opinions, as I imagine the cost, and complexity of applying the encoding for either FNV, or SHA is going to be roughly the same cost.

I don't think using 128-bit FNV is a good idea, because it's not included in the Go stdlib. Although, it's fairly trivial to implement our own, or depend on a third-party library, when there are already an excellent selection of hashing algorithms available to us in stdlib, we should take advantage of them, rather than going off the beaten path.

sargun · 2015-11-17T00:43:12Z

So, it seems like you need another byte for FNV64a to get the same likelihood of collisions as SHA1. Given that name length is precious in DNS, and all of this is precomputed, I err to use SHA1. Especially, because there's less code involved. In fact, we can make the hashString function much simpler:

func hashString(s string) string {
    hash := sha1.Sum([]byte(s))
    return zbase32.EncodeToString(hash[:])[:5]
}

jdef · 2015-11-17T03:08:30Z

ok, sgtm. we can revisit performance if it becomes a problem later

sargun · 2015-11-17T03:55:02Z

See: #356

sttts mentioned this issue May 13, 2015

WIP: Add DiscoveryInfo and custom domain name support #147

Closed

11 tasks

tsenart added the bug label Aug 11, 2015

tsenart added the stability label Oct 6, 2015

tsenart modified the milestone: v1.0.0 Oct 6, 2015

sargun closed this as completed Nov 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hashString in SRV record generation might lead to collisions #157

hashString in SRV record generation might lead to collisions #157

sttts commented May 13, 2015

jdef commented May 13, 2015

sttts commented May 13, 2015

jdef commented May 13, 2015

sttts commented May 13, 2015

sttts commented May 13, 2015

jdef commented May 13, 2015

sttts commented May 13, 2015

sargun commented Nov 12, 2015

jdef commented Nov 12, 2015

sargun commented Nov 13, 2015

jdef commented Nov 13, 2015

sargun commented Nov 13, 2015

sargun commented Nov 17, 2015

jdef commented Nov 17, 2015

sargun commented Nov 17, 2015

hashString in SRV record generation might lead to collisions #157

hashString in SRV record generation might lead to collisions #157

Comments

sttts commented May 13, 2015

jdef commented May 13, 2015

sttts commented May 13, 2015

jdef commented May 13, 2015

sttts commented May 13, 2015

sttts commented May 13, 2015

jdef commented May 13, 2015

sttts commented May 13, 2015

sargun commented Nov 12, 2015

jdef commented Nov 12, 2015

sargun commented Nov 13, 2015

jdef commented Nov 13, 2015

sargun commented Nov 13, 2015

sargun commented Nov 17, 2015

jdef commented Nov 17, 2015

sargun commented Nov 17, 2015